Communications system using rings architecture

ABSTRACT

Systems and methods are provided for implementing: a rings architecture for communications and data handling systems; an enumeration process for automatically configuring the ring topology; automatic routing of messages through bridges; extending a ring topology to external devices; write-ahead functionality to promote efficiency; wait-till-reset operation resumption; in-vivo scan through rings topology; staggered clocking arrangement; and stray message detection and eradication. Other inventive elements conveyed include: an architectural overview of a packet processor; a programming model for a packet processor; an instruction pipeline for a packet processor; and use of a packet processor as a module on a rings-based architecture. Additional inventive elements conveyed include: an architectural overview of a communications processor; a data path protocol support model for a communications processor; an exemplary network processor employed as the core packet processor for the communications processor; an exemplary rings-based SOC switch fabric architecture; and a variety of quality of support features.

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] Priority is claimed based on U.S. Provisional Application No.60/301,843 entitled Communication System Using Rings Architecture, filedJul. 2, 2001, U.S. Provisional Application No. 60/333,516 entitledFlexible Packet Processor For Use in Communications System, filed Nov.28, 2001, and U.S. Provisional Application No. 60/347,235 entitled HighPerformance Communications Processor Supporting Multiple CommunicationsApplications, filed Jan. 14, 2002.

BACKGROUND OF THE INVENTION

[0002] The present invention relates generally to data communicationnetworks and, more particularly, to receiving and transmitting systems,including ATM and other types of communications platforms and includingsuch components as communications processors, packet processors, networkprocessors, DMAs, FPGAs and other devices and peripheral devices.

[0003] The number of business and private home users of computerscontinues to rapidly grow, with these users typically being connected tolocal area networks (LANs), wide area networks (WANs), intranets,extranets, direct subscriber line (DSL) networks, etc. With growingdemand from such users for increasingly large amounts of data acrosssuch networks, bandwidth and data processing and handling speed is anever-present concern facing service and equipment providers to this vastaudience of users. Hubs, routers, modems and switches have been thepredominant mechanisms for providing the interconnectivity for manyusers to access networks. Switches made up of expensive VLSI (very largescale integration) circuits are often used to build out networks. Inaddition to the drawbacks presented by the expense of implementing suchcircuits, clock synchronization is of continuing concern in switchednetworks.

[0004] With the proliferation of the digital age, a significant demandhas arisen for versatile networking technology capable of efficientlytransmitting multiple types of information at high speeds acrossdifferent network environments. One increasingly popular platform isAsynchronous Transfer Mode, commonly referred to as ATM, which wasdeveloped by the International Telegraph and Telephone ConsultativeCommittee (CCITT), and its successor organization, theTelecommunications Standardization Sector of the InternationalTelecommunication Union (ITU-T). ATM is a technology capable of highspeed transfer of voice, video, and other types of data across publicand private networks. Although widely implemented, ATM is just oneexample of many platforms used in handling communications and dataacross networks.

[0005] ATM utilizes very large-scale integration (VLSI) technology tosegment data into individual packets (also referred to as cells). Forexample, B-ISDN calls for packets having a fixed size of fifty-threebytes (i.e., octets). Using the B-ISDN 53-byte packet for purposes ofillustration, each ATM cell includes a header portion comprising thefirst five bytes and a payload portion comprising the remainingforty-eight bytes. ATM cells are routed across the various networks bypassing though ATM switches, which read addressing information includedin the cell header and deliver the cell to the destination referencedtherein. Unlike other types of networking protocols, ATM does not relyupon Time Division Multiplexing (TDM) to establish the identification ofeach cell. Rather, ATM cells are identified solely based uponinformation contained within the cell header.

[0006] Further, ATM differs from systems based upon conventional networkarchitectures such as Ethernet or Token Ring in that rather thanbroadcasting data packets on a shared wire for all network members toreceive, ATM cells dictate the successive recipient of the cell throughinformation contained within the cell header. A specific routing paththrough the network, called a virtual path (VP) or virtual circuit (VC),is set up between two end nodes before any data is transmitted. Cellsidentified with a particular virtual circuit are delivered to only thosenodes on that virtual circuit. In this manner, only the destinationidentified in the cell header receives the transmitted cell.

[0007] The cell header includes, among other information, addressinginformation that essentially describes the source of the cell or wherethe cell is coming from and its assigned destination. Although ATMevolved from TDM concepts, cells from multiple sources are statisticallymultiplexed into a single transmission facility. Cells are identified bythe contents of their headers rather than by their time position in themultiplexed stream. A single ATM transmission facility may carryhundreds of thousands of ATM cells per second originating from amultiplicity of sources and traveling to a multiplicity of destinations.

[0008] The backbone of an ATM network generally consists of switchingdevices capable of handling the high-speed ATM cell streams. Theswitching components of these devices, commonly referred to as theswitch fabric, perform the switching function required to implement avirtual circuit by receiving ATM cells from an input port, analyzing theinformation in the header of the incoming cells in real-time, androuting them to the appropriate destination port. Millions of cells persecond often need to be switched by a single device.

[0009] This connection-oriented scheme permits an ATM network toguarantee the minimum amount of bandwidth required by each connection.Such guarantees are made when the connection is set-up. When aconnection is requested, an analysis of existing connections isperformed to determine if enough total bandwidth remains within thenetwork to service the new connection at its requested capacity. If thenecessary bandwidth is not available, the connection is refused.

[0010] The design of conventional ATM switching systems involves acompromise between which operations should be performed in hardware andwhich in software. Generally, but not without exception, hardware givesoptimal performance but reduces flexibility, while software allowsgreater flexibility and control over scheduling and buffering and makesit practical to have more sophisticated cell processing (e.g., OAM cellextraction, etc.).

[0011] The various protocols associated with platforms such as ATM,Ethernet and others are distinct and require special handling, which isessentially transparent to the user. One approach to packaging thehardware and software necessary to handle the protocol processing andgeneral communications and data processing is system on a chip (SOC),which typically is made up of several modules, often dedicated tospecific tasks, working together. A number of these modules typicallyare interfaces to the external environment, such as Ethernet or Utopia.Others modules can include processors or memories. To illustrate, FIG. 1shows a typical SOC 10, such as a communications processor, having avariety of modules, such as CPUs 14, 22, RAM 16, Ethernet interface 18,i/o interface 20, and DMA 24, interconnected via a switch fabric 12.

[0012] The challenge currently faced by system designers is integratethe modules into a cohesive system. The usual approach is to definebusses, connect the modules on the busses, run signals between themodules via the busses, add bridges to connect busses, and so on. Otherchallenges to designing a SOC, among others, include: heterogeneousperipheral devices; several active modules (CPU, DMA); performancebottlenecks; performance organization of connectivity and busses;customer reality changes over life of a project; design verificationbottleneck, both intra-module and inter-module; and applicationverification. As demonstrated, these challenges result in a considerablenumber of mechanisms needing to be debugged during the design of a SOC.

[0013] Although the traditional bus oriented approach is extensivelyutilized, such an approach typically has the following problems: anumber interfaces to debug for both timing and logic; architecturaldecisions typically need to be done early in design; busses often createunpredictable timing and loadings; changing anything, like addingperipheral or deleting CPU requires considerable revamping of thesystem; and so on.

[0014] A communications processor is one example of a communicationssystem commonly designed using the traditional buss approach. A robustSOC communications processor may find a myriad of applications, such asfor modems, bridges, routers, gateways, multi-service gateways andaccess equipment, and so forth. Such a communications processor may bePHY [Physical layer]-independent, in which case it will be coupled withan appropriate PHY product, or it may by PHY-integrated, in order toprovide the connectivity to the PHY layer of the ATM (or OSI [OpensSystems Interconnection]) layered protocol model. It can be readilyappreciated that if such a SOC communications processor is to be robustin terms of the applications it can support, it must be able to processa wide variety of different protocols, such as ATM, FR (Frame Relay), IP(Internet Protocol), TDM, and so forth. Therefore, in such a SOCcommunications processor, a packet processor for processing the packetsof information that may be of a variety of protocols may be implemented.

[0015] The processing of packets or cells performed by the packetprocessor may include the following tasks: packet header analysis (OSILayer2, Layer3); frame validity—CRC (Cyclic Redundancy Code) check;forwarding decision—look up; header modification /conversion;segmentation and reassembly; data conversion (e.g.,encryption);statistics gathering; and so on. In fact, as bandwidthrequirements go up, and the demand for wire speed packet processingexists, packet processors have to be optimized to solve packetprocessing specific tasks. Proposed solutions for packet processing thatexist today range from hard wired ASICs (Application Specific IntegratedCircuits) (typically inflexible) to programmable packet processors (moreflexible).

[0016] In the last few years, there has been a need for programmablepacket processors for communication systems. The major advantages toprogrammable solutions can include: flexible adjustment for rapidlychanging communication standards;. implementation of increasinglycomplex communications difficult to implement in an ASIC; andconsideration to differentiation and Time To Market (TTM) as a crucialaspect in today competitive environment.

[0017] From the system vendor's vantage, programmable packet processorsgenerally have an advantage over ASIC solutions. A programmable packetprocessor can be viewed as a platform to be quickly deployed (inconsideration of TTM) and then later one can add/modify systemfunctionality by changing/adding code to the packet processor. Thetrade-off system vendors would have at the very high end solutions (corerate OC [Optical Carrier]-48, OC-192, for example) would be power andperformance in programmable packet processors as compared to fixed ASICsolutions. However, several companies have announced programmablesolutions for such core rates, indicating that a programmable solutionis needed by vendors for such core rate products.

[0018] A programmable packet processor (also referred to as a networkprocessor) would preferably provide a solution in the access space wherethe expected aggregate bandwidth is in the range of OC-3 to OC-12. Ofcourse, the access market requirements are different from the networkedge, and the core. At the access points, systems would need to dealwith lots of subscribers (ports), low speed links (T1, xDSL [x DigitalSubscriber Line]) and with different access methods (ATM, IP, FR, TDM,etc.), whereas at the edge and the core of the network generally woulduse one framing solution (MPLS, IP or ATM). Access systems, in thiscase, typically would be characterized by: a large number of subscribers(ports, flows), high density; requirements for Inter Working Functions(IWFs), such as voice (TDM) to packets (ATM or IP) (e.g., Voicegateways), MAN (Metropolitan Area Network) to WAN (Wide Area Network),Ethernet to ATM or PoS [Packet Over SONET]; data grooms—asymmetricbehavior large pipe to many small pipes; and the like. Accordingly,access systems need lots of packet manipulation, especially on mediaconversions and IWF. Therefore, a programmable (and therefore flexible)packet processor often is a preferred solution.

[0019] Such a programmable packet processor could be developed using astandard general purpose microprocessor core. Several processor coresare commercially available, including those that are licensed byAdvanced RISC Machines, Ltd., ARC International, MIPS Computer Systems,Inc., and Lexra, Inc. However, the above cores are general purpose coresthat would need to be optimized for packet processing. Such optimizationtypically would include: additional instructions; DMA support; taskswitch with low overhead; specific bit manipulation instructions; etc.The disadvantages of using such general purpose cores in packetprocessing applications include: costs incurred from license fee androyalties; limited customization—a special license is usually requiredto modify the core; create dependency on the core provider roadmap andtechnical support; over featured—FPU (Floating Point Units), MMU (MemoryManagement Units]; etc.

[0020] Therefore, there is a need for a highly robust programmablepacket processor that can support a variety of high end applications,that is capable of handling a variety of protocols, and that providesdesired performance in terms of speed and power.

[0021] What is also needed is a high performance communicationsprocessor implementing such a programmable packet processor as its corenetwork processor (s), and implementing other useful modules, such asmemories, DMAs, and interfaces to outside PHY platforms, so that thehigh performance communications processor can be beneficiallyimplemented as a SOC solution for a myriad of high end communicationapplications.

SUMMARY OF THE INVENTION

[0022] The present invention overcomes the problems noted above, andrealizes additional advantages, by providing a number of advantages overprior systems.

[0023] The following description is intended to convey a thoroughunderstanding of the inventive aspects by providing a number of specificembodiments and details including, among other things: ringsarchitecture for communications and data handling systems, Enumerationprocess for automatically configuring the ring topology, automaticrouting of messages through bridges, automatic routing of exceptionmessages, extending a ring topology to external devices and providing aflexible and re-configurable system, read return address, write-aheadfunctionality to promote efficiency, wait-till-reset operationresumption, in-vivo scan through rings topology, staggered clockingarrangement, and stray message detection and eradication.

[0024] Other inventive elements conveyed through the embodiments anddetails discussed below include, among other things: an architecturaloverview of a flexible packet processor; a programming model for aflexible packet processor; an instruction pipeline for a flexible packetprocessor; an internal memory to be used with the flexible packetprocessor; the use of a flexible packet processor as a module on arings-based architecture; the core of the flexible packet processor andassociated compounds (agents and non-agents) on the packet processor.

[0025] Additional inventive elements conveyed through the embodimentsand details discussed below include, among other things: anarchitectural overview of a communications processor; a programmingmodel for a communications processor; a data path protocol support modelfor a communications processor; an exemplary network processor employedas the core packet processor for the communications processor; anexemplary rings-based SOC interconnect fabric architecture employed inthe communications processor; a variety of quality of support (QoS)features that implemented in the communications processor; a series ofbeneficial applications of the communications processor; the variousapproaches for the software that can be implemented to power thecommunications processor; specific exemplary strategies for the softwarein the high performance communications processor; and a performanceestimate for RFC 1483 bridging.

BRIEF DESCRIPTION OF THE DRAWINGS

[0026] The present invention can be understood more completely byreading the following Detailed Description of the Invention, inconjunction with the accompanying drawings in which:

[0027]FIG. 1 is a block diagram illustrating a typical system on a chip.

[0028]FIG. 2 is a schematic diagram illustrating a ring architecture inaccordance with at least one embodiment of the present invention.

[0029]FIG. 3 is a flow diagram illustrating an exemplary enumerationprocess in accordance with at least one embodiment of the presentinvention.

[0030] FIGS. 4-8 are a schematic diagram illustrating timing issues in aclocked system in accordance with at least one embodiment of the presentinvention.

[0031]FIG. 9 is a schematic diagram illustrating a mechanism forproviding a clock signal in an opposing direction to data flow in arings network in accordance with at least one embodiment of the presentinvention.

[0032]FIG. 10 is a schematic diagram illustrating a mechanism forproviding a clock signal in a same direction as a data flow in a ringsnetwork in accordance with at least one embodiment of the presentinvention.

[0033]FIG. 11 is schematic diagram illustrating an exemplaryimplementation of a timing interface of a rings interface in a ringsnetwork in accordance with at least one embodiment of the presentinvention.

[0034]FIG. 12 is a schematic diagram illustrating latency issues in aring network in accordance with at least one embodiment of the presentinvention.

[0035]FIGS. 13 and 14 are schematic diagrams illustrating exemplaryimplementations of bridges in ring networks in accordance with at leastone embodiment of the present invention.

[0036]FIG. 15 is a schematic diagram illustrating an exemplaryenumeration process in a ring network having a bridge in accordance withat least one embodiment of the present invention.

[0037]FIG. 16 is a schematic diagram illustrating an exemplary priorityscheme for messages received simultaneously at a same interface of abridge in a ring network in accordance with at least one embodiment ofthe present invention.

[0038]FIG. 17 is a schematic diagram illustrating an exemplaryimplementation of a bridge in accordance with at least one embodiment ofthe present invention.

[0039]FIGS. 18 and 19 are schematic diagrams illustrating an exemplaryprocess for the elimination of stray messages in a ring network inaccordance with at least one embodiment of the present invention.

[0040] FIGS. 20-22 are schematic diagrams illustrating exemplary ringnetworks multiple bridges in accordance with at least one embodiment ofthe present invention.

[0041] FIGS. 23-35 are schematic diagrams illustrating exemplaryimplementations of a scan interface in a ring network in accordance withat least one embodiment of the present invention.

[0042]FIG. 26 is a schematic diagram illustrating exemplary interfacesignals between two members of a ring network in accordance with atleast one embodiment of the present invention.

[0043]FIGS. 27 and 28 are schematic diagrams illustrating an exemplaryimplementation of a ring interface in accordance with at least oneembodiment of the present invention.

[0044]FIG. 29 is a flow diagram illustrating an exemplary process fordetermining an intended recipient of a message in a ring network inaccordance with at least one embodiment of the present invention.

[0045] FIGS. 30-33 are schematic diagrams illustrating exemplarysignaling within a ring interface in a ring network in accordance withat least one embodiment of the present invention.

[0046]FIG. 34 is a schematic diagram illustrating an exemplary use ofbridges in a ring network to minimize latency in accordance with atleast one embodiment of the present invention.

[0047]FIG. 35 is a schematic diagram illustrating an external ringinterface in accordance with at least one embodiment of the presentinvention.

[0048]FIG. 36 is a block diagram illustrating an exemplary system on achip utilizing a ring architecture in accordance with at least oneembodiment of the present invention.

[0049]FIG. 37 is a schematic diagram illustrating the exemplary networkprocessor of the system on a chip of FIG. 36 in accordance with at leastone embodiment of the present invention.

[0050]FIG. 38 is a flow diagram illustrating a low overhead task switchin a network processor in accordance with at least one embodiment of thepresent invention.

[0051]FIG. 39 is a flow diagram illustrating exemplary data paths in anetwork processor in accordance with at least one embodiment of thepresent invention.

[0052]FIG. 40 is a block diagram illustrating exemplary state resourcesof a network processor in accordance with at least one embodiment of thepresent invention.

[0053]FIG. 41 is a block diagram illustrating an exemplaryimplementation of register r1 of a general purpose register of a networkprocessor in accordance with at least one embodiment of the presentinvention.

[0054]FIG. 42 is a block diagram illustrating various registers of ageneral purpose register of a network processor in accordance with atleast one embodiment of the present invention.

[0055]FIG. 43 is a block diagram illustrating an exemplary softwaremodel for a network processor in accordance with at least one embodimentof the present invention.

[0056]FIG. 44 is a flow diagram illustrating an exemplary networkprocessor pipeline in accordance with at least one embodiment of thepresent invention.

[0057]FIG. 45 is a flow diagram illustrating an exemplary networkprocessor pipeline timing in accordance with at least one embodiment ofthe present invention.

[0058]FIG. 46 is a schematic diagram illustrating an exemplary internalmemory for implementation in a network processor in accordance with atleast one embodiment of the present invention.

[0059]FIG. 47 is a schematic diagram of an exemplary network processorin accordance with at least one embodiment of the present invention.

[0060]FIG. 48 is a schematic diagram illustrating an exemplarymultireader agent in accordance with at least one embodiment of thepresent invention.

[0061]FIG. 49 is a flow diagram illustrating an exemplary data alignmentand packing process in accordance with at least one embodiment of thepresent invention.

[0062]FIG. 50 is a flow diagram illustrating a mapping of data from amultireader agent bus to a multireader operation in accordance with atleast one embodiment of the present invention.

[0063]FIG. 51 is a schematic diagram illustrating an exemplary messagesender of a network processor in accordance with at least one embodimentof the present invention.

[0064]FIG. 52 is flow diagram illustrating an exemplary mapping of anagent write command to a message in accordance with at least oneembodiment of the present invention.

[0065]FIG. 53 is a schematic diagram illustrating an exemplary directmemory access agent module in accordance with at least one embodiment ofthe present invention.

[0066]FIG. 54 is flow diagram illustrating an exemplary mapping of dataon an agent bus to a direct memory access command.

[0067]FIG. 55 is a schematic diagram illustrating an exemplary cyclicalredundancy code agent in accordance with at least one embodiment of thepresent invention.

[0068]FIG. 56 is a flow diagram illustrating a mapping of data on anagent bus to cyclical redundancy code data in accordance with at leastone embodiment of the present invention.

[0069]FIG. 57 is a schematic diagram illustrating an exemplary timeragent in accordance with at least one embodiment of the presentinvention.

[0070]FIG. 58 is a flow diagram illustrating a mapping of data on anagent bus to timer data in accordance with at least one embodiment ofthe present invention.

[0071]FIG. 59 is a schematic diagram of an exemplary doorbell agent inaccordance with at least one embodiment of the present invention.

[0072]FIG. 60 is a flow diagram illustrating an exemplary encoding oftask data for use by a doorbell agent in accordance with at least oneembodiment of the present invention.

[0073]FIG. 61 is a block diagram illustrating an exemplarycommunications processor implementing a ring architecture in accordancewith at least one embodiment of the present invention.

[0074]FIG. 62 is a schematic diagram illustrating the exemplarycommunications processor of FIG. 61 in accordance with at least oneembodiment of the present invention.

[0075] FIGS. 63-69 are schematic diagrams illustrating variousimplementations of an external ring interface in a communicationsprocessor in accordance with at least one embodiment of the presentinvention.

[0076]FIG. 70 is a block diagram illustrating an exemplary programmingmodule for a communications processor in accordance with at least oneembodiment of the present invention.

[0077]FIG. 71 is a block diagram illustrating an exemplary data path andprotocol path of a communications processor in accordance with at leastone embodiment of the present invention.

[0078]FIG. 72 is a schematic diagram illustrating an exemplary networkprocessor utilized in a communications processor in accordance with atleast one embodiment of the present invention.

[0079]FIG. 73 is a flow diagram illustrating an exemplary processingpipeline of a network processor utilized in a communications processorin accordance with at least one embodiment of the present invention.

[0080]FIGS. 74 and 75 are flow diagrams illustrating exemplary pacingprocesses utilized in a communications processor in accordance with atleast one embodiment of the present invention.

[0081] FIGS. 76-80 are schematic diagrams illustrating various exemplaryimplementations of a communications processor in communications systemsin accordance with at least one embodiment of the present invention.

[0082]FIG. 81 is a flow diagram illustrating an exemplary flow managerfunctionality of a communications processor in accordance with at leastone embodiment of the present invention.

[0083]FIG. 82 is a block diagram illustrating an exemplary data planedevelopment for use in software development for a communicationsprocessor in accordance with at least one embodiment of the presentinvention.

[0084]FIG. 83 is a block diagram illustrating an exemplary softwaredevelopment model in accordance with at least one embodiment of thepresent invention.

[0085]FIG. 84 is a block diagram illustrating an exemplary softwaredesign approach in accordance with at least one embodiment of thepresent invention.

[0086]FIG. 85 is a block diagram illustrating an exemplary partitioningof software and interfaces in a communications processor in accordancewith at least one embodiment of the present invention.

[0087]FIG. 86 is a block diagram illustrating an exemplary partitioningof software in a network processor in accordance with at least oneembodiment of the present invention.

[0088]FIG. 87 is a flow diagram illustrating a typical process forexecuting program instructions using a known multiple-branch technique.

[0089]FIG. 88 is a schematic diagram illustrating an exemplaryprocessing environment in accordance with at least one embodiment of thepresent invention.

[0090]FIG. 89 is a schematic diagram illustrating an exemplaryarchitecture of a processing unit of the processing environment of FIG.88 in accordance with at least one embodiment of the present invention.

[0091]FIG. 90 is a flow diagram illustrating an exemplary process forexecuting program instructions based on the value of an accumulativeflag in accordance with at least one embodiment of the presentinvention.

DETAILED DESCRIPTION OF THE INVENTION

[0092] The following description is intended to convey a thoroughunderstanding of the inventive aspects by providing a number of specificembodiments and details including, among other things: ringsarchitecture for communications and data handling systems, Enumerationprocess for automatically configuring the ring topology, automaticrouting of messages through bridges, automatic routing of exceptionmessages, extending a ring topology to external devices and providing aflexible and re-configurable system, read return address, write-aheadfunctionality to promote efficiency, wait-till-reset operationresumption, in-vivo scan through rings topology, staggered clockingarrangement, and stray message detection and eradication.

[0093] Other inventive elements conveyed through the embodiments anddetails discussed below include, among other things: an architecturaloverview of a flexible packet processor; a programming model for aflexible packet processor; an instruction pipeline for a flexible packetprocessor; an internal memory to be used with the flexible packetprocessor; the use of a flexible packet processor as a module on arings-based architecture; the core of the flexible packet processor andassociated compounds (agents and non-agents) on the packet processor.

[0094] Additional inventive elements conveyed through the embodimentsand details discussed below include, among other things: anarchitectural overview of a communications processor; a programmingmodel for a communications processor; a data path protocol support modelfor a communications processor; an exemplary network processor employedas the core packet processor for the communications processor; anexemplary rings-based SOC interconnect fabric architecture employed inthe communications processor; a variety of quality of support (QoS)features that implemented in the communications processor; a series ofbeneficial applications of the communications processor; the variousapproaches for the software that can be implemented to power thecommunications processor; specific exemplary strategies for the softwarein the high performance communications processor; and a performanceestimate for RFC 1483 bridging.

[0095] It is understood, however, that the invention is not limited tothe specific embodiments and details, which are exemplary only. It isfurther understood that one possessing ordinary skill in the art, inlight of known systems and methods, would appreciate the use of theinvention for its intended purposes and benefits in any number ofalternative embodiments, depending upon specific design and other needs.

[0096] A number of acronyms are used herein to describe variousembodiments of the invention. A table of acronyms and definitionstherefore is provided as Table 1 below: Acronym Definition AAL ATMAdaptation Layer ABI Application Binary Interface ABR Available Bit RateADPCM Adaptive Differential Pulse Code Modulation ADSL AsymmetricDigital Subscriber Line ALU Arithmetic Logic Unit API ApplicationProgramming Interface ARC ARC Cores ARM Advanced RISC Machines ARPAddress Resolution Protocol ASIC Application Specific Integrated CircuitATIC ATM Interconnect ATM Asynchronous Transfer Mode ATMOS ATM OperatingSystem BGP Border Gateway Protocol (see FIG. 8) B-ISDN BroadbandIntegrated Services Digital Network BLES Broadband Local Exchange Server

[0097] BSC Binary Synchronous Communications protocol (IBM) BSP BoardSupport Package BTS Base Transceiver Station CAM Content AddressableMemory CBR Constant Bit Rate CCITT Consultative Committee onInternational Telegraph and Telephone CES Circuit Emulation ServicesCLEC Competitive Local Exchange Carrier CMTS Cable Modem TransmissionSystem CPCS Common Part Convergence Sublayer (ATM) CPE Customer PremisesEquipment CPP Control Protocol Processor CPU Central Processor Unit CRCCyclic Redundancy Code CR-LDP CR-Label Distribution Protocol CSConvergence Sublayer CTL Control DDR Dual Data Rate

[0098] DLC Digital Loop Carrier DMA Direct Memory Access DRR DataRecovery Report DS Differentiated Services DSL Digital Subscriber LineDSLAM Digital Subscriber Line Access Multiplexer DSP Digital SignalProcessor EA Effective Address E-IAD Enterprise Integrated Access DeviceENET Ethernet EPB External Peripheral Bus EPD Early Packet Discard EPROMErasable Programmable Read Only Memory FIFO First-In-First-Out FPGAField Programmable Gate Array FPU Floating Point Units FR Frame RelayFRF Frame Relay Forum

[0099] FWD Forwarding GFR Guaranteed Frame Rate GPIO General PurposeInput Output HDLC High-level data link control HDSL High-bit-rate DSLH-MVIP H Multi-Vendor Integration Protocol HPCP High PerformanceCommunications Processor HW Hardware IAD Integrated Access Device IDIdentification I/f Interface IMA Inverse Multiplexing over ATM IPInternet Protocol IPoA IP over ATM IS Integrated Services ISOSIntegrated Software on Silicon ISP Internet Service Provider ITU-TInternational Telecommunication Union

[0100] IWF Inter Working Function LAN Local Area Networks LD Load LP LowPriority LPM Longest Prefix Match LSR Label Switched Router MAC MediaAccess Control MAN Metropolitan Area Network MDU Multi Dwelling UnitMEGACO H.242 IEEE (voice protocol) MFSU Multi Function Serial Unit MGCPIETS standard (voice Protocol) MIB Management Information Base MII MediaIndependent Interface MIPS MIPS Computer Systems, Inc. MMU MemoryManagement Unit MPLS Multi Protocol Label Switching MSC Mobile SwitchingCenter

[0101] MTU Multi Tenant Unit MVIP Communication backplane interface NINetwork Interface NP Network Processor OAM Operation and Maintenance OCOptical Carrier OEM Original Equipment Manufacturer OS Operating SystemOSE A name of OS company OSI Opens Systems Interface OSPF Open ShortestPath First PBGA Plastic Ball Grid Array PBX Private Branch Exchange PCMPulse Code Modulation PDU Payload Data Unit PHY Physical layer POSPacket Over SONET PP Protocol Processor

[0102] PPD Parallel Presence Detect PPPoA Point to Point Protocol OverATM PSOS Portable Scalable Operating System PSTN Public SwitchedTelephone Network QOS Quality of Service RAM Random Access Memory REDRandom Early Delete RFC Request for Comment RIP Routing InformationProtocol RISC Reduced Instruction Set Computer RMII Reduced MII RSVPResource Reservation Protocol RTOS Real-Time Operating System RTP RealTime Protocol RX Receive SAR Segmentation and Reassembly SDRAMSynchronous Dynamic RAM SDSL Symmetric DSL

[0103] SHDSL Single-Line High-Bit Rate DSL SIP SMDS Interface ProtocolSMII Serial Media Independent Interface SMTP Simple Mail TransferProtocol SNMP Simple Network Management Protocol SOC System-On-A-Chip SPStrict Priority SPI Serial Protocol Interface SPR Special PurposeRegister SRAM Static RAM SSI Synchronous Serial Interface SSSAR ServiceSpecific SAR ST-BUS a TDM protocol SW Software TCP Transmission ControlProtocol TDM Time Division Multiplexing TM Traffic Management TOS Typeof Service

[0104] TTM Time-to-Market TX Transmit UART Universal AsynchronousReceiver-Transmitter UBR Unspecified Bit Rate UDP Universal DatagramProtocol UPnP Universal Plug 'n Play USB Universal Serial Bus VBRVariable Bit Rate rt-VBR Real Time VBR VC Virtual Circuit VCI VirtualChannel Identifier VCL Virtual Channel Link VoATM Voice over ATM VoIPVoice over IP VP Virtual Path VPI Virtual Path Identifier VLSI VeryLarge Scale Integration WAN Wide Area Networks

[0105] TABLE 1 WBS Wireless Base Station WFQ Waited Fair Queue

[0106] One inventive aspect of the present invention is to provide arings architecture to build a system on a chip (SOC) and allow for easein configuration, expandability and external interface. This ringsarchitecture, in one embodiment, involves: (1) the use of transactionsinstead of signals; and (2) the use of a single switch fabric to carrythe transactions instead of many connections as typically implemented inbuss-based systems. A transaction, in at least one embodiment, includesa instruction generated by a certain module for directing, in astructured way, another module to perform some operation. Transactionsare mapped onto single physical connection. A transaction may direct amodule to, for example, set a set mode flipflop to one or clear registerX or add value Y to counter Z. Transactions also can be used to providetime sequencing. Furthermore, two transactions may be prevented fromoccurring at the same time, limiting the appearance of simultaneouserrors (i.e. bugs). In one embodiment of the present invention, arings-based system on a chip (SOC) is provided. The rings-based SOCcomprises a plurality of ring members on a ring that communicate usingpoint-to-point connectivity, a plurality of ring interfaces forinterfacing the ring members with the ring, a message traversing thering, wherein the message travels one ring member per clock cycle. Inthis embodiment, the system is adapted so that upon the message arrivingat a given ring member the message is processed by that ring member ifthe message is applicable to that ring member, and if the message is notapplicable to that ring member, the message is passed on to the nextring member. Furthermore, subsequent ring members can be adapted tosupply backpressure signals to prior ring members.

[0107] In one embodiment, the message is applicable to the given ringmember based on at least one of an identifier identifying the given ringmember and an identifier indicating that the message applies to multiplering members. The identifier identifying the given ring member cancomprise an address for the given ring member. Furthermore, theidentifier indicating that the message applies to multiple ring membersmay, in one implementation, comprise message data designating themessage as a supervisory message.

[0108] The message may comprise a type field, an address field, and adata field. The message may also comprise an enumberation message,wherein the enumberation message is processed by the ring members inorder to assign address space consumed by each ring member.Additionally, a subsequent supervisory message can cause the results ofthe enumeration message to be returned, thereby allowing a centralmember comprising a CPU to infer the topology of the system.Alternatively, the message can comprise a reset message that isprocessed by the plurality of ring members in order to reset the system.Conversely, the message may comprise an activate message that isprocessed by the plurality of ring members in order to activate thesystem.

[0109] The message also may include a request from a CPU ring memberthat causes the other ring members to report out their addressinformation. The message may also comprise a write message that isprocessed by one of the plurality of ring members to write data thereto,a read message that is processed by one of the plurality of ringmessages to read data therefrom, and/or a stray message indicator sothat the system can identify stray messages.

[0110] In one embodiment, the ring members of the rings based SOCcomprise a CPU and a plurality of peripherals, and wherein theperipherals are adapted to write ahead changes in peripheral status,thereby reducing the quantity of read messages that are issued by theCPU. The ring of the SOC also may include an external ring interfaceallowing the ring to communicate with modules that are not part of thering.

[0111] In one embodiment, the rings based SOC further comprises a landbridge that allows the message to proceed from one side of the ring toan other side of the ring without traversing some of the intermediatering members. The logic of the land bridge may be configured based onthe results of an enumeration message.

[0112] Additionally, the plurality of ring members and plurality of ringinterfaces of the rings-based SOC may comprise a first ring with the SOCfurther comprising a plurality of second ring members and a plurality ofsecond ring interfaces defining a second ring, both the first ring andthe second ring implemented as a system on a chip, and wherein the firstring and the second ring are coupled using a sea bridge. In oneimplementation, the logic of the sea bridge is configured based on theresults of an enumeration message.

[0113] Referring now to FIG. 2, an exemplary ring network 30 isillustrated in accordance with at least one embodiment of the presentinvention. As illustrated, the exemplary ring network 30 includes tworings 32, 34 connected via a bridge 36, each ring including a pluralityof modules 38-48. The modules can include any of a variety of modulesimplemented in SOCs for processing and/or handling data, such as a DMA,an external interface, a timer, a CPU, an I/O, a peripheral, and thelike. In this case, the rings 32, 34 and the bridge 36 represent animplementation of the switch fabric 12 of FIG. 1 in accordance with atleast one embodiment of the present invention. To summarize theoperation of a ring of the ring network 30, consider the followingexemplary operation of ring 32. In this example, messages are passedbetween modules counter-clockwise. When a module receives a message, themodule determines if the message the intended recipient of the message.If the module is the recipient, the module removes the message from thering and processes it accordingly. Otherwise, the module passes themessage on to the next module (e.g., from module 44 to module 46) duringthe next clock cycle. If a module has a message to send, the modulewaits till there is a free slot and passes the message to the module'sleft hand neighbor. In this case, each message is one clock long and themessages travel around the ring 32, one hop per clock.

[0114] Members of the Ring

[0115] Anchor—the host interface. Through this interface, the hostresets, configures and controls the setup functions of the ring. TheAnchor also can be adapted to determine if it is the primary Anchor.

[0116] Bridge (e.g., bridge 36)—a combination of two devices: anupstream link and a downstream link. During the setup stage, the bridgeflips the network ID and acts as an Anchor for upstream ring. The host,after the learning stage, programs the bridge about what switching toperform. The bridge snoops on the ring and if a hit detected, consumesthe message and carries it on the other side. If the message is not hit,the it is sent down as usual. The bridge typically has two address/maskregisters per link direction.

[0117] Module—a collective name for components of a ring, such as a CPU,a bridge, a TDM interface, a Utopia interface, an xDSL PHY, a timer, aUART, a FCC, a MCC, a scratch RAM, a CRC calculator, and the like.

[0118] External Ring (ExtRing)—used to connect several chips to create alarger topology. An external ring is particularly useful in prototypingfuture peripherals by FPGA-extending existing ring-based silicon.

[0119] Packet Processor (also referred to herein as Vobla)—a networkoptimized CPU for managing communication logical links. The packetprocessor, in at least one embodiment, is used to control and terminatestreams that are beyond internal functionality of the device. Thenetwork side is done through the rings, the other side includes, forexample, an external RAM interface.

[0120] The rings architecture has many advantages over traditional busdesigns and is an effective way to connect many different modules,whether on the same chip or on several chips. Instead of using signalsand busses, communication between modules (data and commands) are mappedonto transactions, which in turn are transmitted over ringinfrastructure. Ring topology allows predictable delays and easyscalability. Each ring member adds delay of, for example, one clock. Thering clock frequency can be made as fast as needed because ofgeographical proximity of its members. Rings can be further connectedthrough bridges, such as bridge 36. These bridges are similar to networkswitching devices in the sense that they are programmed to directselected portions of the traffic to the other side (e.g., from ring 32to ring 34). Inside one exemplary embodiment chip, the members of thering are connected to one another using standard [e.g., 8 bits type/20bits address/32-64 bit data] connection. When going outside thestandard, a smaller/slower interface may be defined.

[0121] In the broadest sense, the ring carries two kinds of messages.Setup/Config messages and Work read and write messages. The Setupmessages can be used to learn the network topology, assign addresses andto program the members (i.e., the elements of a ring). Setup messagesare initiated by a host through a special anchor member. Regularmembers, in one embodiment, reply to setup messages by providing thehost their functionality ID, ring ID and their starting address. Thehost software can infer from that data the exact topology of the networkand the functionality of its members. Work messages, in one embodiment,are initiated by members based on their programming and functionality.On each clock a ring member examines its in-port. If the in-port hasvalid message, then the member determines if the message is addressed tothe member. If so, the member removes the message from the ring andprocesses the message accordingly. If not (i.e., the message is intendedfor another member), on the next clock the member transmits itdownstream on the out-port when the out-port becomes available.

[0122] The following are examples of message types that may be used:

[0123] Idle—the connection is idle, i.e., no message; Reset—reset andpropagate to reset the entire network;

[0124] Enumerate—propagate and obey the Enumeration algorithm (describedbelow);

[0125] WhoAmI request—started by the anchor member and flooded unchangedthroughout the ring network;

[0126] WhoAmI response—each member responds to a WhoAmI request bysending this message—the data field contains values of self-address andseveral other significant bits that enable the Anchor to learn thetopology of the network;

[0127] Activate—includes the address of a specific ring member. Whenthis message hits the member, the a subset of the data bits are writteninto the RIF (ring interface) unit control register—the first bit isactivate bit (hence the name). After reset this bit is inactive. Thisprevents any work activity of the peripheral to take place. Setting thisbit to one, enables normal work. Other bits include: scan_mode_enable,stop_clock, in_vivo_scan_test, ring_loopback_enable, (soft reset), aswell as other user-defined bits (discussed below). These bits may bereset to zero;

[0128] Work write—sent during normal operation. These messages activatevarious peripherals, fifos (first-in-first-out), write into memory,etc.;

[0129] Work read—work messages are used to read from fifos, move blocksof SRAM (static RAM) data and communicate with DMAs, to name a fewexamples.

[0130] Exception—started by regular ring members, to propagate to anchor(the assigned member that initiates the Enumeration process) and/or a PP(packet processor) to signify some condition needing attention;

[0131] Freeze—propagate message quickly through the network and disableall activity the rings. Typically used for debug purposes where a fastfreeze of the current state is needed.

[0132] Message Type Encoding

[0133] Table 2 sets forth a listing of message types with a proposedencoding structure and description of the encoding. TABLE 2 message typeencoding description idle 00000XXX supervisor 1111nnnn requests 111110000xF8 Enumerate. 11111001 0xF9 WhoAmI request. 11111010 0xFA Activate11111011 11111100 f0xFC freeze 11111101 11111110 supervisor 11110nnnresponses 11110000 0xF0 WhoAmI response. 11110001 0xF1 error work_read01SWMLFI 0x40 S= enable snoop for the response of this message. W=widthof the data message 64/32 for return M = TBD L =enable addressmodification to indicate last data of frame. F=enable addressmodification to indicate first data of frame. I= increment destination.work_write 10SMLZZZ 0x80 S=Snoop this message. M=TBD. L=Last datatransfer in the message. ZZZ= the number of valid bytes in the message.(ZZZ=000 means 8 valid bytes in the message).

[0134] Ring Member Enumeration

[0135] While it is possible to pre-assign a hard addressing scheme forthe members of a ring network, in at least one embodiment, the modulesassign address space for themselves. As the modules are members of atleast one ring, each module can take a block of address space and tellthe next module its starting address (herein referred to asEnumeration). In many systems, this assignment often gives the sameresults, so it may not be necessary to actually reprogram the modules,but it reduces the need to change hardware registers every time ringconfiguration is changed. This self-addressing also serves as aself-test. In rings-based integrated circuit, such as a SOCcommunications processor, peripherals appear to a CPU as startingaddress. Each offset from this starting address is assigned to adifferent command for the peripheral. Note that assigning differentperipherals to different CPUs can simply be a matter of programming alocation in RAM. Accordingly, several CPU's can be put on a IC withoutworrying about arbitration.

[0136] As discussed above, each member of the ring network haspredefined address space. In one embodiment, this is limited to somepower of 2. For example, if a UART (Universal AsynchronousReceiver/Transmitter—used for serial communications and having atransmitter and a receiver) needs 5 registers, it allocates 8 addressesfor itself. It also should first align the address to a border of 8.

[0137] The Enumeration process starts with the Anchor member, whichsends on its outport an Enum message to begin the enumeration of ringsmembers. As each member receives the Enum message, the member takes theaddress field and increments it to fit its own alignment. This becomesthe zero offset address. Then the address is incremented to nextavailable block of the same alignment. This last address is sentdownstream. Referring to FIG. 3, an exemplary enumeration process isillustrated in accordance with at least one embodiment of the presentinvention. In this example, assume that DMA 52 needs 16 addresses, UART54 needs 4 addresses, and timer 56 needs 256 addresses. Further assumethat the DMA 32 receives an Enum message having an address value=8.Accordingly, in this example, the DMA 52 would align itself to somepower of two (16, in this example) and then claim the next 16 addresses(i.e., addresses 16-31). As a result, the next available address isaddress 32. Therefore, the DMA 52 would change the address value of theEnum message to address=32 and provide this value to the UART 54. Sinceaddress=32 is already aligned with a power of two, the UART 54, in thisexample, claims addresses 32-35 and assigns address=36 to the nextavailable address of the Enum message. This Enum message is thenprovided to the timer 56. Since the timer 56 requires 256 addresses, thetimer 56 aligns its starting address with a power of two greater thanthe next available address (e.g., 256) and claims the next 256addresses. The next available address value of the Enum message is thenchanged to address=512 and provided to the next member of the ring.

[0138] This same enumeration process is repeated for each member of thering network, except bridges, which are discussed in more detail below.In this case, bridges first allocate their own space and then send thein-port Enum message to the other side of the bridge. Further more, thebridge, in one embodiment, is adapted to flip the zero data.Accordingly, when the Enum message is returned to the bridge on theother side, the bridge passes it back on this side. As a firstapproximation, bridges can program the routing themselves. If there areno loops, each bridge may need a maximum of two ranges to look at. It isexpected that no loops exist for Enumeration protocol. So eventually theEnum message will get back to Anchor. This signifies the end of Enumprocess.

[0139] In accordance with one embodiment of the present invention, acommunication system using a ring network architecture is provided. Thesystem comprises a plurality of ring members connected in point-to-pointfashion along the ring network, a transaction based connectivity forcommunicating a message among the ring members, and wherein the messageis a configuration message that causes ring members to assign addressspace in the ring network. In one embodiment, the configuration messageis processed by each ring member to cause that ring member to assignaddress space for that ring member, and wherein the configurationmessage is then passed to the next ring member.

[0140] In one embodiment, the configuration message includes an addressthat defines a starting address. The configuration message, in oneimplementation, is originated by an anchor member, which may include aCPU. In this case, each member processing the configuration message canrevise the starting address before passing the configuration message tothe next ring member. Furthermore, each member processing theconfiguration message can assign the address space of the member usingthe starting address and address space sufficient for that member.

[0141] In one embodiment, a CPU on the ring network of the systemrecognizes other ring members using starting addresses assigned to thosering members based on the configuration message. In this case, offsetsto the starting addresses of the ring members may be used for differentcommands for the ring members.

[0142] Furthermore, in one embodiment, the ring network includes abridge. In this case, the configuration message is processed by thebridge by assigning address space for the bridge and then passing theconfiguration message to the other side of the bridge. The configurationmessage can be processed by the bridge so that a subsequent message isrouted according to whether an address associated with the subsequentmessage corresponds to one side of the bridge or the other side of thebridge. The subsequent message is passed across the bridge when theaddress is associated with the one side of the bridge, and wherein thesubsequent message is passed through the bridge when the address isassociated with the other side of the bridge. Additionally, the bridge,upon receiving a configuration message from one side of the ringnetwork, responds by recording a first address included in theconfiguration message, passing the configuration message to the ringmembers on the other side of the ring network, and recording a secondaddress included in the configuration message when the configurationmessage arrives from the other side of the ring network. In oneembodiment, the first address corresponds to a near side of the bridgeand the second address corresponds to a far side of the bridge.

[0143] In one embodiment, the system further comprises a secondconfiguration message which causes ring members to respond withdescriptive data, wherein the descriptive data can includes addressspace data for the ring members. Using this descriptive data, a CPUmember on the ring network can be adapted to infer the topology of thering network.

[0144] In accordance with yet another embodiment of the presentinvention, a method of assigning address space in a ring networkarchitecture system including a plurality of ring members is provided.The method comprises issuing a configuration message, processing theconfiguration message at each ring member to assign address space forthat ring member in the ring network, modifying the configurationmessage based on the assigned address space, and passing theconfiguration message to the next ring member. The configuration messageis assigned by an anchor on the ring network, wherein the anchor caninclude a CPU member.

[0145] In one embodiment, the configuration message includes a startingaddress and the address space is assigned based on the starting addressand the address needs of that ring member. In this case, the method stepof modifying comprises modifying the starting address before the step ofpassing.

[0146] Furthermore, in one embodiment, the plurality of ring membersincludes a bridge, wherein the bridge responds to the configurationmessage by configuring logic that provides for a subsequent message tobe passed across or by the bridge depending on an address associatedwith the subsequent message. The ring network can be adapted to processa first category of message and a second category of message, andwherein the bridge logic is operative only for the second category. Inone implementation, the first category is a supervisory message and thesecond category is a work message.

[0147] Activation Register

[0148] The activation register, in one embodiment, is part of every ringinterface (RIF). It is sent as reply to Who_Am_I message. Itconcatenates several key parameters of each ring member. It can be usedby the Anchor to learn the topology of the network. It can include thefollowing fields: user_controls; module ID; user_ID; soft_reset; invivo;scan_mode; stop_clock activated; and the like. Module ID is a hardwiredunique ID for each kind of member on the network. Ring ID is, forexample, one-bit used to identify where bridges are inserted. Each timethe Enumerate message crosses a bridge, this bit is flipped. Active bitis set/reset by activate (or activate all) message types to allow normaloperation of the modules. While this bit is reset, the module should notoperate.

[0149] Stages in the Operation of a Rings Network

[0150] Hardware connectivity—This is when the actual hardware isconnected and the topology of the Rings is built. Severalrings-compliant chips can be interconnected through the external ringinterface. The unused interfaces can be shorted out.

[0151] Reset—the first message the Anchor typically propagates is aReset message. It is flooded without clocking. The Host should waitsufficient time for the reset message to flood the whole network.

[0152] Wake-Up—after power-up all modules sitting on Rings typically arein reset mode. All modules have all config bits reset.

[0153] Enumeration—the Host tells the Anchor to spread the Enumeratemessage, starting with some address (usually zero). Each Ring memberreceives the Enum message, computes its own address space needs andtransmits downstream the next available address. The bridges add firsttheir own space on the first ring, then transmits the message to thenext ring. When other side of the bridge consumes its own message, thecloser side continues with the Enum message on the first ring.

[0154] Flood the WhoAmI request—the Host instructs the Anchor to floodthe rings with WhoAmI request message. All modules simply transmit itdownstream, except bridges that follows the Enumeration algorithm. Eachring member first sends its response and clock later try to relay theRequest message. This is so the request message will hit the Anchor onlyafter all responses arrived. Anchor can determine the end of WhoAmIsequence by using this fact.

[0155] WhoAmI response—Each module, after getting WhoAmI request, sendsthe contents of its Activation register as part of the WhoAmI responsemessage. The Anchor should present all these messages to the host. Ittypically is the host's responsibility to infer the network topologyfrom this data.

[0156] ProgramWr—After learning the network topology, via Who_Am_Iresponse messages, the host can start configuring the members. Since itknows each member starting address, the host can send requests to writeto any register. The last stage is to activate the network by writingactive, for example, bit 1 in zero offset register. If during laterstages the Host needs to get the value of any register, it can do so byissuing ProgramRd request and waiting for ProgramRd response. Bridgesare special case for ProgramWr. Bridges need to be programmed first,before trying to pass data across them.

[0157] Activation—After programming stage, the SOC is ready to performprocessing and data handling tasks. To start all modules and enable themto work, the Activate message is flooded throughout the ring network.

[0158] Mode to kill stray messages—It is foreseeable that because of abug in design or programming, a message could be sent that is notaddressed to any member of the ring. Either its address is above thehighest assigned address or it is addressed to empty space betweenconsecutive members. If the address of the stray message is above highlimit, it can be routed to the Anchor and consumed or discarded by theAnchor. However if the stray address is pointing to empty space, thismessage could circle the ring forever. A process used to prevent thisendless loop follows: messages can have an additional bit running alongwith them. If a bridge is passing a message through (not across) it canset this bit on the message. If message arrives to a bridge with thisbit set, the bride discards it. Care should be taken to ensure that onlyone bridge per ring (in case there are several) is operating in thismode. In rings where no bridge exists, the Anchor can perform thisaction. Messages freshly generated will have this bit zero. Also everytime message crosses a bridge (from one ring to another) this bit iscleared. If a message circles the ring for a second time, the designatedbridge will discard it.

[0159] For each ring, only one bridge should execute the above discardprocess. Otherwise legitimate messages could be discarded. The solutionto this problem is as follows: during the Enumeration process, thebridge initializes its sides as a close side and a distant side. Theclose side is where the Enum message appears from. The distant size isthe other side. In this case, the distant side can be selected toperform the monitoring of stray messages. On the primary ring (whereAnchor is located) the job of killing stray messages is done by Anchor.

[0160] Rings Topology Issues

[0161] Clock alignment across a SOC often is a critical feature. Failingit will result in races—which are crippling or at least inefficient.While other undesirable clocking artifacts sometimes can be eliminatedby lowering the frequency, cooling the chip, exposing it to light, etc.,races typically are much more difficult to resolve. As FIG. 4illustrates, if the delay between clk1 and clk2 is greater than thedelay from the output of the first flip flop 60 to the input of thesecond flip flop 62, a race is likely, meaning that the second flip flop62 could sample the data output from the first flip flop 61 a wholeclock period early.

[0162] In rings-based SOC in accordance with at least one embodiment,there typically is no need to align the clocks precisely across thewhole chip. Clock alignment is needed only in singular chunks of data,herein referred to as compounds. Most of the compounds are small, suchas peripherals. Others are of a medium size, such as DMAs. Some areconsiderably large, such as a packet processor. For larger compounds,some kind of clock alignment generally is mandatory. But the overallclocking problem can be divided into smaller, easier solved problems. Toillustrate, in at least one embodiment, signals going between any twomodules are tightly controlled, because they are known in advance andthere is only so many of them (for example, three signal groups: clock,data and backpressure). Furthermore, because of the topology, a solutionin one section typically implies a solution for the whole system. Ofparticular importance is the direction along the ring any of the threegroups takes, how the clock tree runs, and what specialrules/checks/solutions are to be defined and enforced.

[0163]FIG. 5 illustrates a possible solution to the race problem. Inthis example, the clock signal path 64, in the same direction of thedata path 66, is separated into a number of similar compounds (e.g.,compounds 70, 72) By controlling the logic 74, 76 on each flip flopleaving a compound, it can be ensured that the delay between flip flopsis at least long enough to prevent a race condition. This also can beverified after layout.

[0164] Although the solution illustrated in FIG. 5 may be implemented,in at least one embodiment, the clock signal is propagated in theopposite direction of the data, as illustrated with reference to FIG. 6.By providing the clock signal 78 in the opposite direction of the datasignal 80, the potential for race between compounds 70, 72 issignificantly reduced or eliminated.

[0165] In at least one embodiment, there is at least one signal thatgoes against the usual flow of data (signal 80), this signal being theOK signal 82, which is utilized to enable backpressure, as illustratedwith reference to FIG. 7. The OK signal 82 generally needs specialtreatment because it's sampling clock lags behind sourcing clock (signal78). However, this can be solved by ensuring that the return path islonger then clock delay. Alternatively, as illustrated with reference toFIG. 8, a latch 86 may be implemented to ensure that data provided toflipflop 62 changes only after the rising edge of the clock 78 (clkb).

[0166]FIG. 9 illustrates a complication resulting from the propagationof the clock 90 in a direction opposing the propagation of data in aring network having a bridge 94. As illustrated, data_a leaving thebridge 94 goes to member 96 and should be sampled by the rising edge ofclkb. However, clkb lags considerably behind clka of the bridge 94. Asdemonstrated by the waveforms 98, race is eminent. However, by addinglatches to the data lines, race can be eliminated or substantiallyreduced. Likewise, latches should be used on the OK signal to preventrace. It will be appreciated that the latches utility may be limited ifthe delay between, for example, clka and clkb is greater than about 75%of the cycle time since the substantial timing uncertainty may beintroduced. FIG. 10 illustrates a complication resulting from thepropagation of the clock 90 in a same direction of the propagation ofdata 102 in a ring network having a bridge 94. As illustrated, data_bleaves member 96 to be sampled by the bridge 94 using clk_a. As opposedto the situation referenced in FIG. 9, clkb now lags considerably behindclka. However, this may be advantageous if the lag is considerablysmaller than the clock cycle since the data can be delayed beyond thedanger zone of clock delay. Likewise, the OK signal is covered and thelast leg of data is covered. In this case, the only signal thattypically must be considered is the OK signal from the bridge 94 tomember 96. In this case, a latch can be used at member 96 to preventrace in the OK signal.

[0167] It is often desirable to minimize lag between members of a ring,thereby increasing the number of members supported by a single ring aswell as minimizing the timing constraints to be considered. However ifone or more members are packet processors or other modules havingconsiderable processing tasks, the clock entering such modules often isdelayed considerably when the clock is regenerated to drive the bigcompound. In this case, the same principles apply and may be solvedusing latches, as illustrated with reference to FIG. 11, whichillustrates a data signal and clock signal propagating in the samedirection. In this case, the local_clock 110 lags behind thering_interface clock 112 of the module 114 (e.g., a packet processor).For outgoing data, this typically is not a problem since it changeslater then the ring interface flip flops clock. However, for dataentering the module 114 from a previous member, race is a possibility.The same situation may occur in the event that the clock signal 112 andthe data signal 116 propagate in opposite directions.

[0168] In accordance with one embodiment of the present invention, arings-based system is provided. The system comprises a plurality of ringmembers on a ring network that communicate using point-to-pointconnectivity, a message traversing the ring from member to member, wherethe system is adapted so that upon the message arriving at a given ringmember the message is processed by that ring member if the message isapplicable to that ring member, and if the message is not applicable tothat ring member, the message is passed on to the next ring member, andwhere the system further comprises a system clock signal for controllingtiming on the ring network wherein the system clock signal is alignedbetween groups of ring members instead of among all of the ring members.In one embodiment, the system clock signal runs in the same direction asthe message, while in another embodiment, the system clock signal runsin the opposing direction to the message. The alignment can beimplmented to substantially removes skew among the clock signals.Furthermore, the alignment can prevent a flip-flop at a ring member fromsampling data a clock cycle too early.

[0169] The system clock signal alignment preferably is performed amongadjacent ring members, wherein the alignment for a ring member can beperformed with respect to the ring member's upstream and downstream ringmember. The alignment can be performed by inserting logic at the ringmembers that ensures that the delay between adjacent clock signals doesnot exceed the delay between the adjacent members. Similarly, thealignment can be performed using latches that are clocked by clocksignals at individual members.

[0170] In one embodiment, the rings-based system further comprises abackpressure signal that runs in the opposing direction to the message,wherein the alignment is performed by inserting logic at the ringmembers to ensure that the return path for the backpressure signalexceeds the clock delay between adjacent members.

[0171] Bridges

[0172] As discussed previously, the ring topology in accordance with thepresent invention arranges module in a logical ring. All data andcontrol is transmitted over this ring infrastructure sequentially aroundthe ring. However, as illustrated by FIG. 12, considerable ring latencymay be introduced. To illustrate, if module 116 sends a message tomodule 118, there is little latency. However, if member 120 is to passdata to member 122, the data must pass through four modules (i.e., fourclock cycles), resulting in considerably more latency. Another problemis peak latency. To illustrate, suppose that member 116 transmits mainlyto member 122 and member 118 transmits data mainly to member 120. Inthis case, the communication between members 118 and 120 suffersdegradation due to the traffic from member 116 to member 122.

[0173] In at least one embodiment, a bridge may be used to minimize thelatency between members of a ring. As illustrated in FIG. 13, a bridge130 may be used to connect two rings 132, 134. This bridge is analogousto a sea bridge since it connects two rings together just as a seabridge connects two islands. The sea bridge, in one embodiment,determines what messages to cross over between rings and what messagesto keep on the current ring. So referring to the above latency problems,the sea bridge may be utilized to minimize peak latency issues. Toillustrate, if member 134 communicates mainly with member 136,communications between member 138 and member 140 are not affected.

[0174] Intraring latency resulting from a relatively large number ofmembers of the ring between the transmitting member and the intendedrecipient member may be reduced by a land bridge, as illustrated withreference to FIG. 14. The land bridge 146 is utilized within a ring 148to minimize the number of hops for data/clock signals. To illustrate,without the land bridge 146, data from member 150 to member 152 wouldhave to go though 5 members. However, the land bridge 146 reduces thenumber of members in the data path between member 150 and member 152 to3 members (with two of the members being the bridges two interfaces 154,156).

[0175] The bridge, either a land bridge or a sea bridge, is adapted toanalyze a message received at one of its interfaces and to pass themessage through to its other interface or pass on to the next memberdepending on the intended recipient of the message. For example, whenmember 150 sends a message to member 158, the land bridge 146 receivesthe message at bridge interface 154 and determines that the shortestpath is to pass the message from the bridge interface 154 directly tothe member 158. However, when member 150 sends a message to member 160,the land bridge 146 receives the message at bridge interface 154 anddetermines that the shortest path is to pass the message through thebridge to the bridge interface 156 and then from bridge interface 156 tothe member 160. It is not necessary for a bridge to be aware of thetopology of the ring when deciding the more optimal path for a message.Using the enumeration process, the bridge can obtain the informationused to make this decision. Referring now to FIG. 15, an exemplaryrouting process by the bridge 146 is illustrated in accordance with oneembodiment of the present invention. For enumeration purposes the landbridge 146 appears as two ring members (interface 154 being one memberand interface 156 being the second). The member/interface of the bridgehaving the lower address (address=3 in this case) becomes the near end,the member/interface of the bridge having the higher address (address=6in this case) is marked as the far end. A message arriving at the nearend (from direction of the member 150) is passed on if the destinationaddress of the message is greater than 3 and less then 6. Otherwise, themessage is passed through the bridge 146 to the far end (interface 156).On the far end, a message arriving at the interface 156 from thedirection of member 152 will be passed through to the near end(interface 1154) if its destination address is less than 6 but greaterthan 3. Otherwise the message is passed on to member 160. In at leastone embodiment, the address values by which a bridge 146 determines therouting of a message are determined during the enumeration processdescribed herein. FIG. 16 illustrates a situation whereby two messagesare received at an interface 154 of a bridge 146 at a same time. Asillustrated msg1 and msg 2 are received at the same interface 154 at thesame time. In one embodiment, messages transferred between interfaces ofthe bridge 146 are given priority, whereas in other embodiments,messages received at the bridge interface from members of the ring aregiven priority. Referring to FIG. 17, an exemplary implementation of abridge 170 is illustrated. In this example, the bridge 170 includescontrol logic 172 adapted to control the upstream and downstream muxes174-180 to pass either the incoming messages through either the fifo(fifos 182-188) between the downstream input and the upstream output,the upstream input to the upstream output, the downstream input to thedownstream output, and the upstream input to the downstream output.

[0176] In accordance with one embodiment of the present invention, arings-based system on a chip is provided. This system comprises aplurality of ring members on a ring that communicate usingpoint-to-point connectivity, a message traversing the ring from memberto member, the system being adapted so that upon the message arriving ata given ring member the message is processed by that ring member if themessage is applicable to that ring member, and if the message is notapplicable to that ring member, the message is passed on to the nextring member, and wherein at least one of the ring members comprises abridge.

[0177] In one embodiment, the bridge of the rings-based system isadapted to allow messages to travel from one side to another side of thebridge without passing through intermediate ring members. In this case,the bridge can be configured so that the message arriving at the bridgeis routed according to whether an address associated with the messagecorresponds to one side of the bridge or the other side of the bridge.

[0178] Likewise, the message, in one embodiment, is passed across thebridge when the address is associated with the one side of the bridge,and wherein the message is passed through the bridge when the address isassociated with the other side of the bridge. Accordingly, the bridgecan include logic with a range of addresses, such that the message isrouted to one side of the bridge or the other side of the bridgedepending on whether the address is within the range. The logic may beestablished based on a configuration message that causes the ringmembers to assign their address spaces, and the configuration messagemay include an enumeration message.

[0179] In one embodiment, the plurality of ring members of therings-based system are a first plurality of ring members comprising afirst ring network and the system further comprises a second pluralityof ring members comprising a second ring network, wherein the bridgecomprises a bridge between the two ring networks. The bridge can beadapted to determine which messages to pass to the second ring networkand which messages to keep on the first ring network. In this case, thebridge may be configured so that the message arriving at the bridge isrouted according to whether an address associated with the messagecorresponds to one side of the bridge or the other side of the bridge.The bridge can include logic with a range of addresses, such that themessage is routed to the first ring network or the second ring networkdepending on whether the address is within the range. This logic can beestablished based on a configuration message that causes the ringmembers to assign their address spaces. The configuration message, inthis instance, may include an enumeration message. Furthermore, themessage can be passed across the bridge when the address is associatedwith the first ring network, and wherein the message is passed throughthe bridge when the address is associated with the second ring network.

[0180] In another embodiment, the bridge is adapted to process a firstcategory of message and a second category of message. The first categoryof message can include a supervisory message and the second category ofmessage can include a work message. The bridge then can be adapted tomake a routing determination based on the second category of message. Inthis case, the bridge can be adapted to identifies the category ofmessage by examining a message type included in the message.

[0181] Stray Messages

[0182] A stray message is a message addressed to an unused address of aring network. The enumeration process typically leaves gaps of unusedaddress space between active modules when the modules align themselvesto starting addresses being, for example, a power of two. A straymessage usually is a result of a software bug. Unchecked, stray messagesmay slowly choke the ring network, while such messages are difficult todetect and/or debug. However, not every member of the ring is requiredto know about much less have the capability to detect or remove straymessages. In one embodiment, this responsibility falls to the Anchorand/or bridges.

[0183] Referring now to FIGS. 18 and 19, a process for removing straymessages is illustrated in accordance with at least one embodiment ofthe present invention. In the illustrated embodiment, one bit of amessage is used as a marker to determine if a message is a stray. Thebit normally is set to zero, but when a message passes through an Anchor192 or bridge 194, the bit is set to one. If the message arrives at theAnchor 192 or bridge 194 again, the Anchor/bridge notes the set bit anddiscards the stray message, thereby removing the stray from the ring.

[0184] However, it will be appreciated that since a bridge has two ringinterfaces, one of the interfaces must be selected to filter straymessages, particularly in land bridges. To illustrate, if member 196sends a message to address=5 (an unassigned address), the land bridge198 will receive the message at the far end 200 (address=11) and forwardthe message back to the near end 202 of the bridge 198 (address=3),where the process will be repeated unless the stray message is removed.Accordingly, in one embodiment, the far end 200 of the bridge 198 (i.e.,the interface of the bridge furthest away from the anchor) is selectedto filter for stray messages. The stray message marker bit of messagesreceived at the near end 202 remain unchanged while the stray messagemarker bit is set at the far end 200 of the bridge.

[0185]FIGS. 20, 21, and 22 illustrate exemplary ring networks havingmore than one bridge per ring. To illustrate FIG. 20 includes a ringhaving two parallel bridges 208, 210, FIG. 21 has a ring 212 withbridges 214, 216 that cross, and FIG. 22 includes a ring network havingboth a land bridge 222 and a sea bridge 224. Other bridge combinationsmay be utilized in accordance with the present invention.

[0186] Debugging and Testing on the Rings

[0187] Due to the topology of the ring network, there is an opportunityto use the infrastructure of rings to assist scan and debug. The ringscan be used as a scan chain access to individual ring members and also aspecial in-vivo scan mode (discussed below) may be employed. Referringto FIGS. 23 and 24, the insertion of a scan capability is illustrated. Ascan may be enabled by introducing new scan_insert member 230, which isnot a regular member. The scan_insert member 230 can be adapted suchthat it does not introduce one clock delay. For ring signals it is a mux232 between regular ring data and scan input signals. During test modesthis mux 232 inserts scan input signals instead of regular ring data.During normal operation, this mux 232 connects ring infrastructure asusual. In scan mode, the ring is effectively cut off. Insert-scansignals come directly from input pads 234, 236 on the chip. The tap theresults pins drive the output pads. The insert scan signals form threemajor groups: Message type, Message address and Message data.

[0188] Before the actual scan can commence the ring should be programmedto scan mode. This can done by forcing a sequence of supervisor messagesonto the ring. This sequence first resets the ring, then Enumerates it.The last stage is activating for scan of one specific member. After thescan mode is programmed to the member, the actual scan can be done. Scanmux signal is part of the ring. It is programmed via, for example, theexternal pad to create the shift in sequence. Then for one clock it isnegated. During this cycle the scan capture occurs. Then scan mux isasserted again and clocking advances the scan out data. The scan outdata is tapped off the wires entering the scan_insert module. Referringto FIG. 25, exemplary signals 240-250 used as scan chains areillustrated. During scan, several message data signals are used as scanchains. The number of data lines depends on how many parallel scanchains are necessary.

[0189] In-vivo Scan

[0190] A typical silicon debug scenario is as follows: a chip is run forone billion clocks and a bug is discovered. The test is rerun for halfthe clocks and then stopped. all flip-flops values at the stopped statethe source of the problem or error is hopefully determined. In such ascenario, in-vivo scan may be utilized. For an in-vivo scan, the chip isstarted as usual. The software is run for the specified number of clocks(note: optionally, a special counter may be used to freeze the rings.)The ring modules are deactivated then deactivated by, for example, amessage from a certain module. One specified ring module is re-activatedin in-vivo scan mode. This mode causes the module to run shift-out ofall its flip-flops. The module's ring interface is responsible formanaging the scan-out. It counts bocks of, for example, 32 scan-outbits, packages them in one message and ships the message to the Anchor.The Anchor or other module needs to retrieve these messages out of theAnchor and pass them to debug software. The message type typically isthe Program Read Response message, which is designed to get to Anchor.The address is the modules self-address. The data of this message is,for example, 32 bits of scan-out data. Each activation of this modecauses a certain number such messages to be generated. If the moduleshave more flip-flops then the total bit count of the messages, thedesignated module can do this activation again and again. To facilitatefast freeze of members state, a special supervisor message (Freezemessage) is defined to run quickly around the rings and freeze the stateof each module.

[0191] In accordance with one embodiment of the present invention, arings-based system on a chip is provided. The rings-based systemcomprises a plurality of ring members on a ring network that communicateusing point-to-point connectivity, a message traversing the ring frommember to member, where the system is adapted so that, during normaloperation, upon the message arriving at a given ring member the messageis processed by that ring member if the message is applicable to thatring member, and if the message is not applicable to that ring member,the message is passed on to the next ring member, and wherein the systemis further adapted for a scan testing mode in which one of the ringmembers is enabled for a scan output and the other ring membersdeactivated. The deactivated members can be adapted to pass messageswithout consuming the messages.

[0192] The scan output can be packaged into one or more messages thatare transmitted by the one ring member. The one or more messages may betransmitted to a processor, wherein the processor can include a ringmember operating as a supervisor that consumes supervisory responsemessages. In this case, the processor can be adapted to make the datafrom the one or more messages available to debugging software.Additionally, in one embodiment, a second of the ring members of therings-based system comprises a processor that issues at least onemessage that operates to deactivate the other ring members and to enablethe one ring member for the scan output.

[0193] In one embodiment, the operation of the system in the scantesting mode causes the one ring member to shift out flip-flopsassociated with the one ring member into one or more messages sent onthe ring. The scan testing mode can be initiated by resetting the ringnetwork and enabling the one member for the scan mode, where initiationof the scan testing mode may include enumerating the ring network. Inone embodiment, the scan testing mode allows a user of the system todebug the system without adding additional hardware.

[0194] Furthermore, in one embodiment, the plurality of ring members arecoupled to the ring network using a plurality of ring interfaces havingregisters, wherein the registers preferably include bits that can be setto deactivate the ring member associated with that ring interface. Theregisters also may include bits that can be set to enable the ringmember associated with that ring interface for the scan output.

[0195] In accordance with another embodiment of the present invention, amethod of scanning in a ring network having a plurality of ring membersis provided. The method comprises observing a defect or anomaly duringnormal operation of the ring network, issuing at least one message thatcauses one ring member to enter a scan output mode and other ringmembers to be deactivated, resuming operation of the ring network, andoutputting scan data from the one ring member onto the ring network asmessages. The method, in one embodiment, further comprises causing adifferent ring member to enter the scan output mode in order to isolatethe defect or anomaly. The at least one message can comprise at leastone supervisory message that configures bits in ring interfacesassociated with the ring members. Additionally, in one embodiment, thestep of observing takes place at a point in time during the normaloperation, and wherein the step of resuming is carried out just prior tothe point in time.

[0196] During the scan output mode, in one embodiment, the one ringmember packages its scan output as messages to be transmitted to aprocessor ring member. In this case, the processor ring member can beadapted to make the scan output available to debugging software.

[0197] Basic Ring Interface (RIF) Overview

[0198] This section covers three issues. The basic ring timing andbackpressure protocol. It also presents the ring interface unit blockdiagram, which in turn is used to describe the interface to the usermodule connected to the ring. Regular ring members need not be aware ofthe ring intricacies. The basic ring interface is intended to hide mostof the timings and protocols. FIGS. 26, 27 and 28 illustrate anexemplary implementation of ring signaling between modules of a ringnetwork. As discussed previously, in one embodiment, the OK signal 266(back pressure) flows in a reverse direction to inform member 268 thaton the next rising clock 272 it may force new message on type/addr/datalines 274-278. The OK signal 266 is generated by the receiving member270. By default, in one embodiment, the OK signal 266 is active and theonly time it goes down is when the message type is non-idle and there isno room in the correct fifo of member 270. The correct fifo is eitherfifo 280 for through traffic in member 270 or the messages addressed formember 270 fifo. Thus the OK signal 266 is generated by signals comingfrom member 268 to member 270 and is sent roundtrip back during the sameclock.

[0199] The generation of OK signal 266 can be done from flip-flopsresident in member 270 and the type lines of message coming from member268. For example, if the fifo 280 is full, the OK signal 266 is negated,even though the next OK down the ring is active and is freeing an entryin the fifo 280. The same basic OK protocol is used four times in eachRIF (ring interface) unit (FIG. 27). The same OK protocol is valid forthe four exemplary RIF interfaces.

[0200] In accordance with one embodiment of the present invention, arings-based system on a chip is provided. The rings-based systemcomprises a plurality of ring members on a ring network that communicateusing point-to-point connectivity, a message traversing the ring frommember to member, where the system is adapted so that upon the messagearriving at a given ring member the message is processed by that ringmember if the message is applicable to that ring member, and if themessage is not applicable to that ring member, the message is passed onto the next ring member, and the system is further adapted so thatdownstream adjacent ring members provide a signal to their upstreamadjacent ring members that indicates whether a slot is available for theupstream ring member to pass the message to the downstream ring memberon a given clock cycle. The receipt of the signal indicating that a slotis not available, in one embodiment, causes the upstream ring member notto pass the message on that clock cycle. In one embodiment, each ringmember provides the signal to the immediately prior ring member eachclock cycle.

[0201] In one embodiment, each ring member couples to the ring networkby a ring interface, where the signals regarding slot availability arepassed between adjacent ring interfaces. In this case, the ringinterface can include an input FIFO and a through FIFO. The signal canbe generated by the downstream ring member and passed to an immediatelyupstream ring member holding the message, where the signal is generatedaccording to the FIFO for the downstream ring member that pertains tothe message. In this case, the downstream ring member can be adapted todetermine that the input FIFO pertains to the message if the message isto be consumed by the downstream ring member and that the through FIFOpertains to the message if the message is not to be consumed by thedownstream ring member. The determination can be made by the downstreamring member examining information descriptive of the message before themessage in its entirety is sent from the upstream ring member to thedownstream ring member, where the information preferably comprises datafrom a type field and an address field for the message. The signal canindicate that a slot is available when the input FIFO pertains to themessage and the input FIFO can accept a message and/or when the throughFIFO pertains to the message and the through FIFO can accept a message.

[0202] In one embodiment, the signal generated by the downstreamadjacent ring members is a backpressure signal that is generated basedon data sent from the upstream ring member to the downstream ring memberand then back to the upstream ring member in a round trip fashion duringa single clock cycle. Furthermore, in one embodiment, each ring memberhas a ring interface, wherein each ring interface has four interfacesusing or providing the signal which comprises a backpressure signal.

[0203] In accordance with another embodiment of the present invention, amethod of controlling the transmission of messages on a ring networkcomprising a plurality of ring members is provided. The method comprisesproviding a message at a first upstream ring member that is availablefor output to a second adjacent downstream ring member, receiving asignal at the upstream ring member from the downstream ring member thatindicates whether a slot is available for outputting the message on aclock cycle, and outputting the message from the upstream ring member tothe downstream ring member if a slot is available and holding themessage if a slot is not available.

[0204] In one embodiment, the signal is generated based on the contentof the message. In this case, the signal can be generated based onwhether the message will be consumed by the downstream ring member orpass through to a further downstream ring member. The content of themessage preferably includes at least a portion of the message typeand/or at least a portion of the message address.

[0205] Furthermore, in one embodiment, the downstream ring member iscoupled to an input FIFO and a through FIFO, wherein the downstream ringmember determines which FIFO pertains to the message. The downstreamring member also can determine whether the pertinent FIFO is capable ofaccepting the message.

[0206] The Imessage path is the messages intended for this member. Eachmessage bus on the diagram above is actually collection of three fields:type/8, addr/20, data/64. It is true for 3 out of 4 interfaces. ForImessage path, the type can be in most cases reduced to work/program andread/write. Also, several other bits of type might be needed, like lastand size. For the address field only low order bits are needed. Theaddress bits needed are the bits that cover the internal module addressspace. The data field might be reduced in some cases to 32 bits or evenless, for example 8 bit UART. The Imessage fifo may be a very reducedversion of other fifos.

[0207] The Omessage fifo 282 transmits messages originating locally tothe outside ring. It has to support full fields, because many kinds ofmessages can be produced. As can be seen from FIG. 28, the OK signallogic 284 originates in the sending member 268. It starts with creatingmessage type and address. Type and address fields travel to member 270,whereas, using these two fields, a decision is made as to whether themessage is a through message or it ends at and is consumed by member270. In each case, the status of the corresponding fifo is transmittedback as the OK signal. The next rising clock samples this OK to muxeither previous message or new one or idle. As presented, all fourinterfaces of RIF have similar turnarounds with their OK signals.

[0208] Routing of Incoming Messages

[0209] Referring now to FIG. 29, an exemplary process for routing ofincoming messages is illustrated in accordance with at least oneembodiment of the present invention. As illustrated, incoming messagesto a module are examined first to determine if the message is asupervisor or work/program message. Using the address field 290, theintended address of the message can be determined. Since, in oneembodiment, the address of the module is aligned to a power of two, anaddress mask 292 (referred to as split mask) may be used to compare onlya subset of the bits of the address. The lower part 294 of the addressis passed into the module as an internal address. The subset of bits arecompared against a self-address register 296 containing the addressesassociated with the module (obtained during the enumeration process). Ifthe subset 294 matches the self-address register 296, the module canconsider the message to be addressed to the module. Using theours/through indication to create the correct DOK (down ok) signal, theabove discussion ignores the supervisor messages. Some of supervisorsmake different use of the address field, when they apply to all members(Enumerate). Some of the supervisor messages are responses from members.These messages carry address of the sender.

[0210] Referring now to FIGS. 30-33, exemplary implementations of theRIF 300 are illustrated in greater detail.

[0211] The main RIF registers include:

[0212] self_address_valid bit flipflop: indication that Enumeration wasrun and address assigned;

[0213] self_address: value of self address. This register typically is20 bits although fewer bits may be used, as the lower part of thisregister typically is zero;

[0214] idnumber: a constant parameter used to identify the associatedmember;

[0215] ADDRESS_PACE: this is the number of bits used by internal addressspace. It is used to calculate the address space claimed by the ringmember.

[0216] activated bit: This bit is reset at hardware reset and modifiedfurther by activate messages. If this bit is active, the ring interfaceis in work mode. It will process work messages. If this bit is inactive,the ring member should wait for programming or activation; scan_enabledbit in activation register : turns the module into scan mode. Reset byhardware reset, further modifiable by activation messages.

[0217] in_vivo scan and related: scan out of all registers duringinterruption of normal work. This is done on per module basis.

[0218] RIF Signal Descriptions

[0219] By convention, the term input refers to a signal entering a ringinterface and output refers to a signal driven by the ring interface.

[0220] The pins to a subsequent ring member/from a previous ring memberinclude:

[0221] rif_d_type[7.0]: input, message type

[0222] rif_d_addr[19.0]: input, message address

[0223] rif_d_data[63:0]: input, message data

[0224] rif_d_ok: output, backpressure, goes back to previous member

[0225] rif_d_clock: input, clock in signal

[0226] rif_d_scan: scan mode enable (the actual muxing signal, not testmode)

[0227] rif_d_reset: input, h/w reset

[0228] rif_d_passed_me: input, indicates that message passed throughbridge or Anchor already

[0229] Pins for messages entering the ring member include:

[0230] rif_i_write: output, this message is valid write and can comefrom a program or work write. The RIF module modifies the options bits(see below) in case of program write.

[0231] rif_i_read: output, this message is valid read.

[0232] rif_i_options[5:0]: output, rest of the bits of type in themessage. These bits are relevant to more sophisticated members, snoopingon last and such. For simple members they do not have to be used. Optionbits have one out of two possible interpretations. One for read and onefor write. For write: snoop, last and size. For read: enable snoop,width of the response (64 bit or 32 bit, for example), enable lastaddress modification (end of frame indication), enable first addressmodification (start of frame) and increment destination. Discussed abovewith reference to message type encoding.

[0233] rif_i_addr[15.0]: output, relevant part of address

[0234] rif_i_datal[31.0]: output, relevant part of data low

[0235] rif_i_datah[31.0]: output, relevant part of data high

[0236] rif_i_ok: input, tells the RIF that message is accepted bymember. On the next clock, a new message may be sent.

[0237] Control pins entering the RIF include:

[0238] rif_activated: output, reflects activated bit in activationregister, if not enabled this bit prevents work messagesentering/exiting the member. Also, peripherals should not starttransmit/receive operations with this bit disabled.

[0239] rif_reset: output, either hard reset or soft reset;

[0240] rif_scan_mode: output, reflects scan bit in activation registerif enabled, this member is under scan test;

[0241] rif_scan: output, scan muxing signal if enabled, in shift of scanoperation, if disabled with mode, means capture;

[0242] rif_self address[19:0]: output, self address;

[0243] rif_clock: clock for local flipflops;

[0244] rif_user_id[1:0]: user defined modifier of module ID input;

[0245] rif_user-control[3:0] bits from activation register for userdefinition and use;

[0246] Pins for messages going to the next member of the ring include:

[0247] rif_u_type[7.0]: output;

[0248] rif_u_addr[19:0]: output;

[0249] rif_u_datal[31:0]: output, data low;

[0250] rif_u_datah[31:0]: output, data high;

[0251] rif_u_ok: input, back pressure from next member;

[0252] rif_u_clock: output, clock out signal;

[0253] rif_u_scan: output, scan mode enable (the actual muxing signal,not test mode); rif_t_reset: output, hardware reset;

[0254] rif_u_passed_me: output, indicates that message passed throughbridge or

[0255] Anchor already; Pins for messages exiting the member include:

[0256] rif_o_type[7:0] input, message type bits (type[7:3] !=0) act asvalid indication;

[0257] rif_o_addr[19:0]: input, message address;

[0258] rif_o_datal[31:0]: input, message data low half;

[0259] rif_o_datah[31.0]: input, message data high half;

[0260] rif_o_replace: input, request to replace the relevant part ofdatal with self address bits;

[0261] rif_o_ok: output, tells the member that message is accepted byRIF;

[0262] Anchor RIF Interface

[0263] The Anchor RIF interface, in one embodiment, is a variation onthe RIF interface used by regular ring members. It has one more statevariable—active/passive Anchor. If the Enumerate message comes throughdmessage inputs, then an Anchor declares itself passive. If Enumerationmessage comes from omessage input, then the Anchor declares itself anactive Anchor. An active Anchor consumes all supervisor messages,whereas in regular RIFs, supervisor messages are ignored by passing themall to imessage output. For work messages there is another difference.Anchors have self-address space like any other ring member. Workmessages addressed to Anchor address space are consumed. Anchors alsoparticipate in stray message kills (as discussed above). If messageaddressed above (or below) Enumerated address space, it will be caughtand discarded by the Anchor.

[0264] Bridge RIF

[0265] A primary function of the Bridge to direct traffic between rings.During Enumeration, the Bridge learns all it has to know about thetopology. Signal interfaces of a bridge are identical to two sets ofregular RIF. The only exception is clock, which has a tree-topology.Other tug-along signals, like scan, take the longest (crossover) route.From a hardware point of view bridge can be viewed as two RIFs connectedback to back. However, the bridge provides additional functionality. Forone, the bridge records the first input to receive the Enumerationmessage. The end lucky to get hit first by Enumeration is labeled near,because it is closer to the Anchor. The other end is labeled far. Alsothe incoming Enumeration address is recorded as low range. TheEnumeration message is sent to the other far side. When it returns onthe far side dmessage input, The address is recorded again as highaddress. At this point bridge is ready to work.

[0266] During normal operation, Supervisor request messages, in oneembodiment, are crossed to the other side. Supervisor response messagesare moved to near umessage output. Program write messages and Programread requests are treated as work messages. Program read responses aremoved to the near umessage output. Work messages are routed based onlow/high bounds. If message address is between low/high bounds it ismoved to the far umessage output. Otherwise the near side gets it. Thefar side also participates in detecting and removing stray messages.

[0267] In one embodiment, messages appear to member module throughrif_i_signals.

[0268] These signals include:

[0269] rif_i_write: changes just after rising edge of the clock. ifactive means valid write message arrived. Valid means correct type andcontext, The user does not have to worry about decoding message typesand such;

[0270] rif_i_read: changes same, means valid read message arrived;

[0271] rif_i_options[5:0]: bits extracted from type part of the message.For read they mean snoop, width, last, first and increment and for writethey mean last, snoop and size bits;

[0272] rif_i_ok: member generates positive acknowledge to ringinterface. This signal should be valid (or negated) shortly afterrif_i_read or rif_i_write become valid. If OK is negated during thiscycle, on the next cycle same message data will be driven. Membersshould make every effort to keep this signal very active;

[0273] rif_i_addr[19:0], rif_i_data/[37:0]and rif_i_datah[31.0]

[0274] General controls entering a RIF include:

[0275] rif_clock: clock;

[0276] rif_reset: reset;

[0277] rif_activated: member received ok to operate. This signal isuseful for Rx peripherals, not to start bothering anyone withoutactivation;

[0278] rif_self_address[19:0]: self address on the ring;

[0279] Constant controls exiting a member and entering ring_controlinclude:

[0280] module_id[7:0] these two bits can be used by members to tell thesystem something specific about themselves. For example Ethernet MACscan use one of these signals to tell the world if they are 10 or 100mbit connected;

[0281] rif_o_type[7.0] is the type of outgoing message;

[0282] rif_o_addr and rif_o_datal/datah are rest of the message bits;

[0283] rif_o_ok: if in current cycle this signal is inactive (low),don't change the message on the next positive edge.

[0284] Ring_control parameters include:

[0285] ring_interface_unit (also called ring_control) has 2 parameters,which should be set at verilog instance time. ADDRESS_SPACE: this numbersignifies the number of internal address lines that should enter themember. for example, member has internal memory map of 256 bytes itneeds 8 address lines to address this space. Its ADDRESS_SPACE should beset to 8. It also means that to recognize a message to this member the12 most significant bits of the message address are used. MODULE_ID:each hardware ring member gets, for example, 8 bits for a unique ID.This ID is unique to all instances of the same hardware, for example,all Ethernet MACs have the same ID. To distinguish between differentMACs, self_address and user_id bits can be used. Module ID can beexamined by Anchor using Who_Am_I messages. Module ID typically is partof the response by any module.

[0286] Reset on the Ring

[0287] Each ring-based SOC typically has only one Anchor. The hardwarereset starts at this Anchor. The Anchor has a hw_reset input pin. Fromthis pin, reset is sent in two directions. One direction is down thering. The other direction is to the module that hosts the Anchor, forexample, a packet processor. The reset propagates through the ring inthe logical ring order. It is the same path all supervisor messagestake, although the reset is a signal rather than a message. However itis unconditionally flip-floped at each ring member. It is also possibleto force soft reset on ring members using Activate messages.

[0288] In accordance with one embodiment of the present invention, arings-based system is provided. The rings-based system comprises aplurality of ring members on a ring network that communicate usingpoint-to-point connectivity, a message traversing the ring from memberto member, wherein the system is adapted so that upon the messagearriving at a given ring member the message is processed by that ringmember if the message is applicable to that ring member, and if themessage is not applicable to that ring member, the message is passed onto the next ring member, and where the message causes a reset, such as asoft reset, of the given ring member if the message is applicable tothat ring member. The message preferably includes address informationcorresponding to the given ring member. The message can include anactivate message that includes at least one bit for causing a reset.

[0289] The message, in one embodiment, causes a reset by writing atleast one bit from the message into a ring interface for the givenmember. In this case, the ring interface can includes a bit that isreset by the message, where the bit preferably includes an activated bitor a reset bit. The ring interface can be adapted to provide an outputto the given ring member for causing the reset, wherein the outputpreferably includes a control pin coupled to the given ring member.

[0290] In accordance with another embodiment of the present invention, arings-based system is provided. The rings-based system comprises aplurality of ring members on a ring network that communicate usingpoint-to-point connectivity, a message traversing the ring from memberto member, wherein the system is adapted so that upon the messagearriving at a given ring member the message is processed by that ringmember if the message is applicable to that ring member, and if themessage is not applicable to that ring member, the message is passed onto the next ring member; and wherein the system further comprises areset control signal that causes multiple members of the ring network tobe reset (such as a hard reset).

[0291] The reset control signal can include a hardware signal that issent independent of the message. Furthermore, the reset control signalcan be sent on a different line from the message. The reset controlsignal can be adapted to cause all ring members except for the memberfrom which the reset signal originates to be reset. The reset controlsignal, in one embodiment, causes a reset by causing the reset of bitsin ring interfaces corresponding to the multiple members. In this case,the ring interfaces can provide an output to their corresponding ringmembers to cause the resets, where the outputs can include control pinscoupled to the corresponding ring members.

[0292] In accordance with an additional embodiment of the presentinvention, a rings-based system is provided. The rings-based systemcomprises a plurality of ring members on a ring network that communicateusing point-to-point connectivity, a message traversing the ring frommember to member, the system being adapted so that upon the messagearriving at a given ring member the message is processed by that ringmember if the message is applicable to that ring member, and if themessage is not applicable to that ring member, the message is passed onto the next ring member, wherein the system includes a message that cancause a reset of the given ring member if the message is applicable tothat ring member, and wherein the system further includes a resetcontrol signal that causes multiple members of the ring network to bereset. The message that can cause a reset can cause a soft reset of thegiven ring member, wherein the reset control signal causes hard resetsof the multiple members.

[0293] Message Types and Formats

[0294] Messages come in roughly four categories:

[0295] Supervisor requests—include reset, Enumerate, Who_Am_I requests,activate, freeze. These messages are generated by Anchor and are floodedthrough the network.

[0296] Supervisor response—include Exception, WhoAmI_response. Thesesupervisor messages are generated by regular members and float to theAnchor for its attention.

[0297] Programming—include regular work write and read messages.

[0298] Work—includes work_read and work_write.

[0299] The Enumerate message: The Enumerate (or Enum) message isinitiated by the active Anchor. In each ring system there is only oneactive Anchor. Anchor decides it active, if it is told to start theEnumeration through omessage inputs. The message can include a headerfield, a data field, a next available address field, a ring ID, and thelike. The ring ID is bit flipped every time the message crosses abridge. It is recorded in activate register in every ring interface.This bit can later be used by software to determine the exact ringtopology.

[0300] Who_am_I message: To learn the topology, Anchor startsWhoAmI_request message. Each member that receives this message, firstlyresponds to it, then relays the request message. This order assures thatAnchor will see the request message only after all responses. Thus itcan determine that the WhoAmI process ended. In request message thefield typically used is the type field. The address part of the messageis the module's Self_Address. The data field holds info about themodule.

[0301] Activate message: The Activate message is issued through theAnchor. It carries the address of a specific member and a few bits inthe data field used to write the activation register. The bits in theactivation register control the state and behavior of the members.

[0302] Freeze message: The freeze message unclogs rings and deactivatesall members.

[0303] Tools for Module and Ring Network Builder

[0304] Write Ahead Mode—Read operations in a rings-based architecturetypically is much more time consuming than write operations.Accordingly, in another inventive aspect of at least one embodiment ofthe present invention, status registers are usually inspected by CPUsbefore sending or receiving data. It generally is desirable to getstatus fast. The delay of two-way trip from CPU to peripheral and backoften is unacceptable. The present invention provides that theperipheral, every time its status changes, sends it ahead to one or morepre-arranged locations in a CPU's RAM or other device. The extension ofthis idea is to change every critical read to send-ahead write. Inessence, every time important parameter changes in some perihperal, itsvalue is written to an agreed memory in the asker space. For example,the CPU needs to know how many free entries are there in a Utopia fifo.Instead of doing read operation initiated by CPU, the fifo, each timethis number significantly changes, will write it in some agreed locationof CPU's RAM. The CPU now only needs to read its local memory.

[0305] To implement the above write ahead modality, a rings-based systemon a chip is provided in accordance with one embodiment of the presentinvention. The rings-based system comprises a plurality of ring memberson a ring that communicate using point-to-point connectivity, a messagetraversing the ring from member to member, where the system is adaptedso that upon the message arriving at a given ring member the message isprocessed by that member if the message is applicable to that ringmember, and if the message is not applicable to that ring member, themessage is passed on to the next ring member. The system also is adaptedto process both read messages and write messages. The plurality of ringmembers includes a CPU and at least one peripheral that exchanges datewith the CPU, wherein the peripheral includes at least one status memorythat stores data describing the status of the peripheral, and where thesystem is configured to write ahead status changes that are accessibleby the CPU.

[0306] The system also can be adapted to perform write ahead statuschanges that would otherwise be initiated by the CPU as read operations.Likewise, the write ahead operations can be programmed to occur based onread operations that would otherwise be initiated by the CPU on aregular basis. The system can be adapted to write ahead status changesto a RAM on the CPU or a RAM that is accessible by the CPU. The CPU cancomprise a control protocol processor in a communications chip ornetwork processor in a communications chip. The status memory maycomprise at least one status register.

[0307] In at least one embodiment, the write ahead operations areperformed for some peripheral status changes but not other peripheralstatus changes. Additionally, the write ahead operation is performed ornot performed depending on the nature of the status change.Alternatively, the write ahead operation is performed or not performedbased on the magnitude or the quantity of the status change.

[0308] In accordance with another embodiment of the present invention, awrite-ahead method in a rings based communication system, such as acommunications processor or a network processor, is provided. The methodcomprises identifying at least one module in a ring network thatincludes status registers that store status information of regularinterest to a processor in the ring network, identifying which statusinformation can be transmitted to the processor as a write aheadoperation initiated by the at least one module instead of a readoperation initiated by the processing, and programming the at least onemodule to transmit the identified status information as a write aheadoperation. In one embodiment, the step of programming causes the averagenumber of read operations initiated by the processor to decrease.

[0309] In one embodiment, the identification comprises identifying whichstatus changes are of critical importance or of regular interest to theprocessor. Alternatively, the identification can include identifyingwhat magnitude or level of status change will cause the write aheadoperation.

[0310] Land Bridges—Most members on a ring typically communicate in anasymmetric way. For example, EnetRx (Ethernet receiving) traffic ismostly from a peripheral to a packet processor. For EnetTx (Ethernettransmitting) it is the other way around. Pair of members is asymmetricif one is mainly the sender and the other is mainly the receiver intheir relationship. In this case it makes sense to put the senderupstream from the receiver. But some pairs are almost symmetric. Apacket processor paired with a DMA is such an example. As such, nomatter how they are placed on a ring, one direction is bound to suffer.In this case, one or more land bridges generally will provide thesolution.

[0311] As discussed previously with reference to FIG. 14, a single landbridge can be added to minimize latency between two members of a ring.As illustrated in FIG. 34, two or more bridges 332, 334 may be added toa ring 336 to further minimize the number of modules between any tworing members. Although each bridge 332, 334 adds two interfaces(members) to the ring network, this generally will not affect thelatency significantly since a message is unlikely to travel the entireperimeter of the ring network due to the bridges.

[0312] Implementation of an External Ring Interface

[0313] Referring now to FIG. 35, an exemplary external ring interface340 is illustrated in accordance with one embodiment of the presentinvention. Ring connections between two members can include more than100 signals. Each message can include, for example, at least 104signals. Therefore, it may be unreasonable to add this amount of pins(twice) to implement the external ring interface. As such, it may bepreferably to implement a dual purpose peripheral interface 340, such asUtopia. Normal mode of operation for an Utopia interface issending/receiving ATM cells. In a similar manner, two rings networks,such as two network processors, can be connected with Utopia interfacesback to back. In this mode, instead of cells, Utopia pins will conveymessages. This will slow the specific ring speed, but not the chip speedsince if the Utopia interface is behind a bridge, only messages to theother side are slowed down, not the internal messages. Using Utopiainfrastructure for this, also enables us to connect an external FPGA 344(Field-Programmable Gate Array) as a new peripheral.

[0314] The following is non-inclusive list of some of the identifiedadvantages associated with the rings topology of the present invention:high speed circuit design—all connections are point to pointunidirectional connections; scalability—once the address routing isresolved the actual topology can be changed relatively easily; theswitch fabric is transparent to software, only delays are affected bythe topology; typically easier to implement than crossbar or switchdesign; debug and test visibility—each member can be examined andoperated alone; possibility of late processing load balancing—differentperipherals can be assigned to different CPUs; and the possibility of noneed for precise across-the-chip clock alignment—clock can be adapted torun along messages.

[0315] Although any of a variety of CPUs may be implemented as a moduleof the ring network topology described herein, ring networks areparticularly well-suited for packet processors, various emobiments ofwhich are described in detail below. The packet processor of the presentinvention may on occasion be referred to herein as the Vobla, thenetwork processor, and similar variations. According to one embodiment,the network processor of the present invention may be implemented aspart of a communications processor having multiple modules that areinterconnected using the rings architecture described above. The modulesin such an arrangement for a communications processor may include thenetwork processor of the present invention (for data plane processing ofpackets), a control packet processor (for control plane processing as aflow manager), various peripheral modules, and so forth.

[0316] In accordance with one embodiment of the present invention, arings-based system is provided. The rings-based system comprises aplurality of ring members on a ring network that communicate usingpoint-to-point connectivity, a message traversing the ring from memberto member, where the system is adapted so that upon the message arrivingat a given ring member the message is processed by that ring member ifthe message is applicable to that ring member, and if the message is notapplicable to that ring member, the message is passed on to the nextring member; and the system further comprising means for providing anexternal ring interface that enables communication with at least oneexternal peripheral device. The means can comprise a field programmablegate array and/or a memory port ring member on the ring network. The atleast one external peripheral device can include one or more of a DSP,encryption engine, external bus, external memory, a second ring network,and the like.

[0317] In one embodiment, the means is adapted to perform handshakingbetween the protocols of the ring network and the at least one externalperipheral device, wherein the handshaking preferably includesconverting message data from the ring network into transaction data. Themeans also can be adapted to allow the ring network to write outmessages to the at least one external peripheral and the at least oneexternal peripheral to generate transactions converted into messages forthe ring network.

[0318] The means, in one embodiment, operates as a shared memory betweenthe ring network and the at least one external peripheral. In this case,the means may include a memory that operates as a RAM for messagesreceived from the ring network and as a FIFO for transactions receivedfrom the at least one external peripheral device. The means also mayinclude a memory, wherein the ring network can write data to an addressin the memory to cause an interrupt in the at least one externalperipheral device.

[0319] In one embodiment, the ring network is a first ring network on afirst chip, where the rings-based system further comprises a second ringnetwork on a second chip, and wherein the first ring network and thesecond ring network interface through the means to the at least oneexternal peripheral device.

[0320] Alternatively, the ring network can include a firstcommunications processor including a first protocol processor and asecond network processor, and the system can further comprise a secondcommunication processor including a second protocol processor and asecond network processor, wherein the first communications processor andthe second communications processor interface through the means to theat least one external peripheral device.

[0321] In accordance with yet another embodiment of the presentinvention, a network processor implemented on a chip is provided. Thenetwork processor comprises means for processing a plurality ofprotocols including ATM, frame relay, Ethernet, and IP, said means beingprogrammable using a set of library commands to process additionalprotocols, and wherein said means comprises an arithmetic logic unit(ALU), a load/store unit (LSU), a preload/bump unit (PBU), a registerfile unit (RFU), an agent interface, and an internal memory. The networkprocessor, in one embodiment, further comprises a fetch unit and aprogram sequencer.

[0322] The ALU can be adapted to perform arithmetic and logic operationson data operands. The LSU can be adapted to perform address calculationsin order to address data operands in the internal memory. The LSUcalculates an effective address according to one of five availableoptions, including: (1) effective address is the content of a registerfrom the RFU; (2) effective address is the sum of content of a firstregister from the RFU and content of a second register from the RFU; (3)effective address is the sum of content a first register from the RFUand content of a second register from the RFU after the second registeris shifted by a specified number of bits; (4) effective address is thesum of the content of a register from the RFU and a displacement thatoccupies a specified number of bits in an instruction word; and (5)effective address is an absolute address included in the instructionword. The PSU, in one embodiment, performs decoding of instructionsreceived from the internal memory. The fetch unit can be adapted tocontrol what instructions are fetched from memory for decoding by thePSU. The internal memory can be adapted for storing program informationand data.

[0323] The RFU, in one embodiment, comprises a first register file for acurrent task and a second register file for preloading register valuesfor a next task. In this case, data may be read to or written from thefirst register file based on a comparison between a current task ID anda task ID associated with the first register file. The RFU also cancomprise a third register file for storing register values for thecurrent task that are not stored in the first register file. In thiscase, data may be read to or written to the third register file when thecurrent task ID and the task ID associated with the first register fileare not the same. In one embodiment, a task switch is performed by thenetwork processor by making the next task the current task andpreloading a further next task. The performance of a task switch caninclude treating the second register file as the third register fileafter the task switch.

[0324] The agent interface, in one embodiment, allows the networkprocessor to interface to external modules for executing instructions,where the external modules can include one or more of a CRC module,encryption module, hashing module, and table lookup module.

[0325] In yet another embodiment of the present invention, acommunications processor implemented on a chip is provided. Thecommunications processor comprising a network processor including meansfor processing a plurality of protocols including ATM, frame relay,Ethernet, and IP, said means being programmable using a set of librarycommands to process additional protocols, wherein said means comprisesan arithmetic logic unit (ALU), a load/store unit (LSU), a preload/bumpunit (PBU), a register file unit (RFU), an agent interface, and aninternal memory. The communications processor further comprises aprotocol processor for controlling the network processor, wherein theprotocol processor performs control plane processing and the networkprocessor performs data plane processing. The network processor can beadapted to process instructions by performing a fetch, decode, address,execute, and a write.

[0326] In one embodiment, the network processor and the protocolprocessor are ring members on a ring network, and further comprising aplurality of other ring members on the ring network. In this case, thenetwork processor includes a plurality of compounds that share a singlering interface to the ring network, wherein the compounds can include,for example, a doorbell agent for controlling the execution sequence oftasks for the network processor. The compounds also may include amultireader agent for servicing requests to read data from the internalmemory, a message sender agent for sending messages onto the ringnetwork, a DMA agent for sending messages to initiate a DMA controlleron the ring network, a CRC agent for performing CRC calculations, and/ora debug module. Generally, a packet processor includes the followingcapabilities that are typically not found in general purposemicroprocessors:

[0327] Zero overhead task switching—Usually, each interface (I/f) portwould require at least 2 tasks (RX [receive], TX [transmit] to handlethe datapath processing. A system that includes several ports wouldrequire about two or more active tasks for each port. As such, thepacket processor should be able to switch tasks with minimum overhead.The packet processor may allocate shadow memory (4-8 tasks) to storeregisters and task status. The priority scheme to choose thenext_task_to_run is hardware (HW) based and is not performed by software(SW) as in a RISC (Reduced Instruction Set Computer) model.

[0328] Parallel engines—Processing of packets can use parallel machinesto accelerate performance. Examples for this capability include DMA,CRC, Lookup engine, and Peripheral Transfer Machine. A well-built packetprocessor would have the mechanism in place to issue and receivesynchronically transactions to parallel machines without stalling thepacket processor.

[0329] Data movements—Packet processing require data movements fromFirst-In-First-Out (FIFO) memory to internal memory, and from internalmemory to external memory and vice versa. This is performed usingparallel Direct Memory Access (DMA) machines. Data transfers should beoptimized and deterministic within boundaries. Hence the rightmechanisms have to be in place between the DMAs and the packet processorto allow the transactions between the engines and to ensuredeterministic behavior.

[0330] Scalability—One way to scale the throughput of a packet processoris by instantiating several engines. Hence, it is desirable that theprogramming model and the system architecture be flexible enough toaccommodate scalability.

[0331] Special instructions—Packet processing uses special operationsthat are not common for a general purpose processor. Instructions likeCompare immediate under mask (to match specific bits), activation ofparallel engines using instructions like CRC, DMA, HASH, LIST SEARCH,and mechanisms such as Sticky bits for compare and jump, are derivedfrom the needs of packet processing.

[0332] Inter-task communication—Inter-task communication is supported bythe architecture. Traditional RISC machines generally use SW for thiscommunication.

[0333] Efficient link list operation—Data structures like link lists,queues and buffers are common in communication systems. A flexiblepacket processor should be able to manage a large number of differentqueue types in an efficient and quick way.

[0334] Exemplary Processing Requirements

[0335] According to one aspect of the invention, the flexible packetprocessor should support processing of the following: ATM, Frame Relay(FR), IP/Ethernet, IWF (TDM to Packets), AAL2 for wireless basestations, IP, and MPLS.

[0336] ATM is by far the largest access method in the access space. Apacket processor in the space should to be able to terminate ATM virtualcircuits (VCs) Customer Premises Equipment (CPE) and should be able toswitch ATM. ATM is of particular interest because a vast majority of theDSL approaches use ATM as the carrier technology. Frame Relay is ofinterest because it is commonly used in corporate access (e.g., usingT1s or NxT1).

[0337] After dominating the LAN space, Ethernet is becoming a costeffective technology for the Metropolitan Area Network (MAN). Thissimplifies the need for a costly router (no ATM) at the corporate edge.This is a new approach that ISPs (CLECs [Competitive Local ExchangeCarrier]) use as a way to replace the old Telco access (leased lines).However, Ethernet access does not solve the issue of how to deal withcorporate voice. Typical requirements for IP/Ethernet would be IProuting and Ethernet bridging at 100 Mbps and approaching IG-Enet.

[0338] Packet processing for inter-working functions (IWF) (e.g., TDM topackets) is typically found in Voice Gateways (VG) and in Wireless BaseStations (WBS). The VG interface the POTS (plain old telephone system)network on one side and the packet network on the other side. Voicecalls are modified (compressed and packetized, or uncompressed andcircuitized) between the networks. Hence typical processing requirementsat the VG include: termination of AAL2 streams; support for CES (CircuitEmulation Services) (AAL1) to emulate T1 services; termination of RTP(Real Time Protocol) (VoIP) packets; and the like AAL2 processing mayfind useful application for Wireless Base Stations. New generation WBSsuse ATM as their backbone network. To optimize bandwidth, AAL2 may bechosen to carry both voice and data. In that case, the followingprocessing requirements result: AAL2 Termination at the BTS (BaseTransceiver Station); AAL2 Switching the BTS and at the MSC (MobileSwitching Center)/BSC (Base Station Controller); AAL2 Termination isdone at the MSC/BSC (OC-3 and IP is routed to ISP); and IMA (InverseMultiplexing over ATM) is being used as the connection between BTSs andthe MSC both for redundancy and for cost.

[0339] The flexible packet processor should handle IP because IPprocessing can be found in various applications in the access space,such as the following: ISP aggregation router; DSLAM for handlingframes; Cable modem head end; Wireless base station; MPLS (MultiprotocolLabel Switching) is a newcomer to the access space. It is being used fortraffic management and for Quality of Service (QoS) control. It isdesirable that access equipment support LSR (edge device) (LabelSwitched Router) for MPLS.

[0340] As demonstrated above, the access market requires differentaccess methods. The access market has a need for IWF between thesedifferent methods, which, in turn, drives the requirement for uniqueprocessing capabilities. Also, the different market segments have manysimilarities regarding their processing requirements. Thus, a flexiblepacket processor according to the invention can form the basis of anaccess platform that is capable of addressing multiple applications inthis space.

[0341] Architectural Overview of a Flexible Packet Processor

[0342] The flexible packet processor in accordance with variousembodiments of the present invention is a general-purpose networkprocessor core, allowing it to support many system-on-chip (SOC)configurations. A library of modules containing memories, peripherals,accelerators, and other processor cores makes it possible for a varietyof highly integrated and cost-effective SOC communication devices to bebuilt around the packe processor. Figure shows a block diagram of anexemplary SOC chip 350 made up of the network processor core 354 andassociated SOC components (described below) according to an embodimentof the invention. Although not indicated in this configuration, atypical SOC can contain more than one network processor core 354.

[0343] Internal Memory Expansion Area (Internal Memory 352)—On-chipmemories operating at full core frequency are connected to the networkprocessor core 354 through this component. The internal memory isunified and can be used for both program and data storage. Differenttechnologies such as SRAM or ROM can be used to implement the internalmemory.

[0344] Network Processor Core 354—The network processor core is theprocessor in which the network data path application code is executed,and which may include: a program sequencer unit (PSU); a load store unit(LSU); a fetch unit (FTU); a data arithmetic logic unit (DALU); aregister file (RFU) including support of fast task switching; a preloadand bump unit (PBU) for efficient task switching and context save andrestore; and the like. These components are discussed below in greaterdetail.

[0345] A companion (sometimes called a compound) that is tightly coupledto the network processor core is the doorbell scoreboard module(doorbell) shown in FIG. 36. The doorbell receives requests for servicefrom peripherals, accelerators and DMAs, and then determines a next taskID once a task switch occurs in the network processor.

[0346] Peripheral Expansion Area 356, Accelerators 358 and SystemExpansion Area 360—These components shown in FIG. 36 include thefunctional units that interface between the network processor core andthe application, including the functions that send and receive data fromexternal input/output sources. In addition, these components includeaccelerators 358 that execute portions of the application in order toboost performance and decrease power consumption. These components areapplication-specific and may or may not include various functional unitssuch as: a host interface; an external memory interface (e.g., SDRAMcontroller); a serial interface (USB, UART, SSI ([Synchronous SerialInterface], Timers); a communications interface (Utopia, MII); a CRCaccelerator; a able look up coprocessor; Smart FIFO; a data pump; adirect memory access (DMA) controller; as well as other CPU cores, suchas packet processors (PPs).

[0347] To provide the data exchange between the core and the otheron-chip blocks or modules, the following ports may be implemented: datamemory ports (address, data read and data write) used for data transfersbetween the core and memory; program memory port (address and data read)for fetching code from the memory to the core; agent port to supporttightly-coupled external user-definable functional units such asperipherals, accelerators, DMA's, smart FIFOs, and so forth; and acontext memory port (address, data read and data write) used for thepreload and bump of registers for fast task switching.

[0348] Referring now to FIG. 37, the network processor core 354 isillustrated in greater detail in accordance with at least one embodimentof the present invention. As discussed above, the network processorcore, in one embodiment, includes the following:

[0349] Data Arithmetic Logic Unit(DALU or ALU) 370 The DALU 370 (alsoreferred to as the ALU below) performs the arithmetic and logicaloperations on data operands in the network processor core. The dataregisters can be read from or written to memory over, for example, a32-bit wide data bus as 8-bit, 16-bit, or 32-bit operands. The sourceoperands for the ALU 370 are 32 bits wide and originate either from dataregisters or from immediate data (1 mm). The results of ALU operationsare stored in the data registers.

[0350] According to one aspect of the invention, ALU operations areperformed in one clock cycle. The destination of each arithmeticoperation can be used as a source operand for the operation immediatelyfollowing the arithmetic operation without any time penalty. In oneembodiment, the components of the ALU 370 are as follows: an integerarithmetic unit for 32-bit non-saturated three-operand arithmeticoperations; a logic unit for 32-bit logic operations; a bit field unit(BFU) for multi-bit shift, rotate, swap and bit-field insert and extractoperations; and a condition code generation unit.

[0351] The ALU 370 may read two operands from the register file via thedual source bus (src1 and src2 in FIG. 37), or one operand from aregister via the source bus and a second immediate operand via theimmediate bus (1 mm input to DALU on FIG. 37). The ALU 370 generates aresult into a destination register via the destination bus (dest on FIG.37).

[0352] The condition codes are optionally generated in the conditioncode register (part of the R1 register, discussed further below)depending on the instruction type.

[0353] The ALU 370 may support both signed and unsigned arithmetic. Mostof the unsigned arithmetic instructions are performed the same as thesigned instructions. However, some operations may require specialhardware and may be implemented as separate instructions. Whenperforming an unsigned comparison, for example, the condition codecomputation is different from signed comparisons. The most significantbit of the unsigned operand has a positive weight, while in signedrepresentation it has a negative weight. Special condition codes andinstructions may be implemented to support both signed and unsignedcomparisons.

[0354] The Load Store Unit (LSU) 372

[0355] The LSU 372 performs address calculations using integerarithmetic needed to address data operands in memory. In addition, theLSU 372 generates change-of-flow program addresses. The LSU 372 operatesin parallel with other network processor core resources to minimizeaddress generation overhead.

[0356] The effective address (EA) used to point to a memory location fora load or a store is calculated according to one of the followingoptions. According to one embodiment, only the 16 least significant bits(LSBs) of the calculation result are considered. The options forcalculating the EA include:

[0357] Register indirect, No update (Rn). The EA is the content of aregister Rn from the register file.

[0358] Indexed by register Ri (Rn+Ri): The EA is the sum of the contentsof the register Rn and the contents of the register Ri.

[0359] Indexed by a shifted register Ri (Rn+(Ri<<m)). The EA is the sumof the contents of the register Rn and the contents of the register Riafter Ri is pre-shifted to the left by m bits.

[0360] Indexed by displacement (Rn+xx). The EA is the sum of thecontents of the register Rn and a displacement xx that occupies m bitsin the instruction word. The displacement is sign-extended and added toRn to obtain the operand address.

[0361] Absolute address: The EA is the absolute address expressed in theinstruction.

[0362] The Network Processor Registers

[0363] The network processor registers are classified into three types:General Purpose Registers (GPR); Special Purpose Registers (SPR); andHidden registers (HR). The general purpose registers may be used by theprogrammer to load data from memory, execute arithmetic or logicoperations, and store the data back into memory. The special purposeregisters are registers that have an associated functionality, such as atask SPR, and so forth. Generally, SPRs may not be loaded or storeddirectly from/to memory. According to one approach, a dedicated moveinstruction can move data between general purpose registers and specialpurpose registers. Hidden registers are registers which are not exposedto the programmer, but reside in the hardware as part of the machinecontrol (e.g., a current PC [Program Counter] register).

[0364] The General Purpose Register File 374

[0365] The network processor of the present invention includes a specialregister file architecture and a memory block that are capable ofmanaging a large number of tasks (threads) with substantially no cyclepenalty. The memory block has the capacity to store the register contextof the tasks. The register file architecture performs a reduced numberof context save and restore operations and enables each active task withits own context registers.

[0366] The benefits of this approach, discussed in detail below, includeat least some of the following: support of nearly unlimited tasks; nocycle overhead for context save and restore operations upon taskswitches; transparency to the programmer; and cost-effectiveness and lowcircuit overhead.

[0367] One conventional approach to the multi-task switching issueprovides that every task switch is accompanied by a context save andrestore cycle, usually performed by software. This approach takes extracycles. Another conventional approach uses special circuitry that allowsaccess to the memory using wide busses, thus enabling multiple registersto be saved or restored at a time. This approach reduces the number ofcycles, but complicates the interface to the memory (the Tricore CPUfrom Siemens uses this approach). Another approach uses multipleregister files, one for each task. This approach has the disadvantage oflimiting the number of tasks to the number of register files, and thisis also a costly and limiting solution. The large number of registerfiles can also impact the frequency of operation due to fan-outlimitations. (Products using this approach include, for example, theIntel IXP12000 and Lexra NetVortex LX8000 Network Processor.)Accordingto one approach taken by the instant invention, the programming model ofthe network processor core has 32 general purpose registers. Theseregisters can be read from or written to over the memory data buses(e.g., referring to FIG. 37, the src1, src2, and dest buses). Sourceoperands for ALU instructions originate from these registers. Accordingto one beneficial aspect of the invention, the destination of an ALUinstruction is a register and such a destination can be also be used asa source operand for a subsequent ALU instruction in the operationimmediately following, without any time penalty.

[0368] At the heart of the network processor core 354 is a set of threeregister files and dedicated hardware that implements a mechanism forautomatically saving and restoring the registers such that a task switchis accomplished with minimal overhead on the main flow. Upon entering atask, both the current and next task identification (task ID) aresampled. These three register files are as follows: the active registerfile—used to run the current task; the Shadow1 register file—containsthe valid register values of the current task that do not exist in theactive register file; and the Shadow2 register file—used to preloadregister values of the next task concurrent with the current task run.The active register file has 32 general purpose registers. Theseregisters are part of the programming model and are exposed to theprogrammer. According to one approach, each register of the activeregister file has a 32-bit data field and a 6-bit tag field. The tagfield holds the task ID, which identifies the task for which the dataregister value is valid.

[0369] The network processor core 354 includes a boundary register whichspecifies for each of the registers whether it is considered a globalregister or a general register. The global registers may store globalvalues that can be shared among multiple tasks, or they may storetemporal values that are not preserved when the task yields and resumesprocessing.

[0370] The Shadow register files (Shadow1 and Shadow2) are not part ofthe programming model, i.e., they are not exposed to the programmer.Each of the Shadow1 and Shadow2 register files includes, for example, 32registers of 32 bits.

[0371] According to one approach, task switches do not require anexplicit save/restore of the general registers. Saves and restores ofthe general registers are done implicitly by hardware according to thefollowing mechanisms. In case of a write to a general register, the taskID associated with the register of the active register file is firstcompared to the current task ID. If the result is equality, this meansthat the register is maintained by the current task, and, therefore, theregister is overwritten with the new value and the current task ID ismarked in its tag field. A non-equal result means that the registercontains valid data for a different task. In this case, the old registercontent is first sent to a write queue buffer to be saved in memory in atask ID context table, and then the new value is overwritten to theregister and the current task ID is marked in its tag field.

[0372] In case of a read from a general register, the task ID associatedwith the register is first compared to the current task ID. An equalresult means that the register contains valid data for the currentrunning task, and thus the data is read directly from the register. Anon-equal result means that the register contains valid data for adifferent task. However, the valid data for the current task for thatregister resides in the Shadow1 register file, as it was preloaded toShadow2 concurrent with the execution of the previous task. As a result,the register value is read from the Shadow1 register file, and theregister of the active register file remains unchanged.

[0373] A read or write access to a global register accesses the activeregister file directly without changing the register's tag. Concurrentwith the execution flow of the current task, a special machine (the PBU376 of FIG. 37) preloads the register values of the next task ID intothe Shadow2 register file.

[0374] Upon a task switch request, the following actions should takeplace: the preload of the register values of the next task should becompleted; the Bump buffer is emptied—all data which was sent to thebump unit is saved in the context table; the next task becomes thecurrent active task; the Shadow2 register file becomes the shadow forthe current task (Shadow1); and a new next task is sampled and a newpreload procedure is initiated onto Shadow2. Special care should betaken (and special logic may be implemented) to prevent hazard cases.For example, a mismatch in the register value occurs if a register inthe active register file is tagged for a task ID which is identical tothe next task ID, and that register is accessed as a destination in thecurrent task. In this case the register value should be first saved inmemory in its context location and then overwritten with the new valueof the current task. However, since the previous task is identical tothe next task, it could be that the register value is already preloadedinto the next task shadow register file (Shadow2). In this case, thepreloaded value into Shadow2 is no longer valid.

[0375]FIG. 38 illustrates the register files structure and a mechanismfor low overhead task switch according to an embodiment of the inventionin accordance with the discussion above. In the top half 390 of FIG. 38,the current task ID is Task_X, the next task ID is Task_Y. In the bottomhalf 392 of FIG. 38, after a task switch the current task ID becomesTask_Y and the next task ID becomes Task_Z.

[0376] In accordance with one embodiment of the present invention, amethod for efficient processing of tasks in a communications system isprovided. The method comprises sampling a current task identifier and anext task identifier, providing a first register file for storing valuesfor a current task, and providing a second register file for storingvalues for the current task that are not in the first register file. Themethod further comprises providing a third register file for preloadingvalues for the next task, and performing a task switch by making thenext task identifier the current task identifier and sampling a furthernext task identifier. The method can further comprise the step ofcompleting the preload of the register values for the next taskidentifier which after the task switch is the current task identifier.In this case, the method may also comprise using the third register fileas the second register file after the task switch.

[0377] The first register file, in one embodiment, comprises registerswith a data field and a task identifier field. In this case, the firstregister file has 32 registers, each register having a 32 bit data fieldand a 6 bit task identifier field. The first register file may beexposed to a programmer of the communications processor and the secondregister file and the third register file are hidden from theprogrammer. In one embodiment, task switches are performed without anexplicit save/restore of the register files.

[0378] The method can further comprise performing a write duringexecution of the current task by: comparing the current task identifierto a task identifier in the first register file; writing a value to thefirst register file when the current task identifier is the same as thetask identifier in the first register file; and writing a value to thefirst register file when the current task identifier is not the same asthe task identifier in the first register file after the content in thefirst register file is saved to a memory. The content in the firstregister file can be saved to a task identifier context table.

[0379] The method may also comprise performing a read during executionof the current task by: comparing the current task identifier to a taskidentifier in the first register file; reading a value from the firstregister file when the current task identifier is the same as the taskidentifier in the first register file; and reading a value from thesecond register file when the current task identifier is not the same asthe task identifier in the first register file. In this case, thecontent of the first register file may not be changed as a result of theread.

[0380] In an additional embodiment of the present invention, a systemfor efficient processing of tasks in a communications system isprovided. The system comprises means for sampling a current taskidentifier and a next task identifier, a first register file for storingvalues for a current task, a second register file for storing values forthe current task that are not in the first register file, a thirdregister file for preloading values for the next task, and means forperforming a task switch by making the next task identifier the currenttask identifier and sampling a further next task identifier.

[0381] In one embodiment, the means for performing a task switchcompletes the preload of the register values for the next taskidentifier which after the task switch is the current task identifier.Similarly, the means for performing a task switch uses the thirdregister file as the second register file after the task switch.

[0382] The first register file comprises registers with a data field anda task identifier field, wherein the first register file can have 32registers, each register having a 32 bit data field and a 6 bit taskidentifier field, and further wherein the second register file and thethird register file each have 32 registers.

[0383] The system may further comprise a processor which performs awrite during execution of the current task by: comparing the currenttask identifier to a task identifier in the first register file; writinga value to the first register file when the current task identifier isthe same as the task identifier in the first register file; and writinga value to the first register file when the current task identifier isnot the same as the task identifier in the first register file after thecontent in the first register file is saved to a memory. The content inthe first register file can be saved to a task identifier context table.The processor may comprise an ALU.

[0384] The system may also comprise a processor which performs a readduring execution of the current task by: comparing the current taskidentifier to a task identifier in the first register file; reading avalue from the first register file when the current task identifier isthe same as the task identifier in the first register file; and readinga value from the second register file when the current task identifieris not the same as the task identifier in the first register file. Inthis case, the content of the first register file is not changed as aresult of the read. In one embodiment, the means for performing a taskswitch comprises a preload and bump unit. The processor may comprise anALU.

[0385] The Preload and Bump Unit (PBU) 376

[0386] Referring back to FIG. 37, The PBU 376 controls the access ofdata memory for the automatic save and restore of registers in theircontext table in memory. A save of a register content in its location inthe table context is performed whenever the register in the activeregister file is addressed as a destination and the register containsvalid data for a task different from the current running task.Generally, only one request for a save can be captured in the PBU 376for a single instruction because only one destination can appear in aninstruction.

[0387] The PBU 376 includes a write queue with a number of entries inorder to minimize the interference with the main program flow, thusoptimizing the total execution time. Whenever a register addressed as asource does not contain valid data for the current running task, thedata is read from the Shadow1 register file where it was previouslypreloaded.

[0388] The PBU 376 is also responsible for controlling the preload ofthe next task registers into the Shadow2 register file. The PBU 376generates the data memory accesses for save (write) and preload (read)using the context address and data busses. According to one embodimentof the invention, the load store cycles of the active flow have highestpriority, followed by the preload cycles, and, at the lowest priority,are the save cycles from the write buffer.

[0389] The Program Sequencer Unit (PSU) 378

[0390] The PSU 378 performs the instruction decoding and generate thecontrols for the other core units. The PSU 378 controls the program flowincluding all scenarios involving the change of flow.

[0391] Fetch Unit (FTU) 380

[0392] The FTU 380 is responsible for controlling the program counter(PC) for instruction fetch operations. According to one embodiment ofthe invention, the PC may be derived from one of the following sources:sequential increment; jump to an absolute address; jump to an addressspecified by a register; task switch to a next task entry point;relative change of flow; exception control (e.g., reset, breakpoint,patch, etc.); and return from trap.

[0393] Messaging Interface (Agent Interface) 382

[0394] A few instructions are executed in an external module (e.g., DMA,accelerators, etc.) connected to the network processor core. A messagingbus (Agent Interface or AGI) from the core to the external moduleenables the definition and support of such an extension of theinstruction set.

[0395] Memory Interface 384

[0396] According to one aspect of the invention, the network processorcore uses a unified memory space wherein each address can contain eitherprogram information or data. This memory space is typically based onon-chip RAM and ROM. The memory module should have separate ports forprogram, data and context accesses. Also, this memory module may haveadditional ports for accesses from the external world, such as the ringinterface.

[0397] A Programming Model for a Flexible Packet Processor

[0398] The programming model describes the rules for writing networkprocessor programs. After a brief introduction that explains in generalterms the organization of the network processor code and the flow ofdata through the system, the programming model (e.g., state resources,interfaces and instruction groups) is outlined in high level terms.Then, the execution flow and performance issues are discussed. And last,the programming model is detailed.

[0399] Organization of the Network Processor Code

[0400] According to one embodiment of the invention, the networkprocessor comprises a 32-bit single issue RISC processor tailored forreal-time communication processing goals. According to an embodiment,the network processor has 32 general purpose registers, built-in supportfor multi-tasking, communication peripherals, on-chip SRAM, a DMAinterface to external SDRAM, a built-in interface to an on-chip controlprocessor (referred to as the host processor or the Packet processor[PP] or the Control Packet processor [CPP]).

[0401] It is desirable that the network processor have hardware supportfor up to 62 tasks. The hardware support includes generation of taskactivation triggers, automatic task scheduling, save and restore ofregisters to and from the shadow register area in internal SRAM, specialinstructions for yielding the CPU, and support for passing messagesbetween tasks.

[0402] Each network processor task has a dedicated register set. Thetask registers are preserved across the periods in which the task is notrunning. A network processor task can access internal memory with loadand store instructions, and can copy data from internal to externalmemory and vice-versa using special DMA instructions.

[0403] The data which a task operates upon can be classified into thefollowing categories (with reference to FIG. 39):

[0404] Data from the communication peripherals (arrow 402):

[0405] This data is copied, using a special instruction from theperipheral's FIFO, into internal memory (arrow 406). On the transmitside, this data is copied, using a special instruction, from internalmemory into the peripheral FIFO. This type of data, which is in transitthrough the device, can be referred to as stream data. Stream dataexchanged with the host processor (arrow 408): This data is passed by anetwork processor task, usually in external memory, for furtherprocessing to the host processor. On the transmit side, the hostprocessor passes this data to a network processor task fortransmit-related tasks (such as encapsulation, shaping, scheduling, andso forth) and for transmission through a peripheral. Stream data is alsohanded over between network processor tasks. There are cases when thestream data is not touched by the host processor.

[0406] Configuration data: This data resides in internal memory and isset at initialization time by the host processor or by initializationprocedures on the network processor (e.g., buffer size). Configurationdata is consumed, but not produced, by the task.

[0407] Flow state data: This data is kept in internal or externalmemory, and describes, for example, the state of each ATM connection orthe state of the current Ethernet frame. Part of this data is used andupdated by the task (e.g., the cell count for a connection).

[0408] Task state data: This data is kept in internal memory (orregisters), and is used by the task to keep information in case the taskdoes not complete the work intended to be accomplished during a singleperiod of possession of the CPU.

[0409] A High Level View of the Programming Model

[0410] According to an embodiment of the invention, the programmingmodel for the flexible packet processor includes the following elements.state resources—the hardware memory entities which hold the state of theprogram; interfaces—of the ways in which the program should behave tointeract with hardware resources which are external to the processor;and instruction set—the description of the basic tools with which theprogram performs its operations.

[0411] State Resources

[0412]FIG. 40 provides an overview 420 of the state resources for thenetwork processor according to an embodiment of the invention.

[0413] Interfaces

[0414] DMA interface. The DMA interface controls the DMA machines, whichcopy data from the NP SRAM to external DRAM and vice versa. The DMAinterface is set up by the PP at initialization time, and accepts actioncommands from the NP via special instructions. The DMA interfaceconnects to the doorbells and the task scheduling mechanism.

[0415] Peripheral FIFO interface. The peripheral FIFOs are set up by thePP at initialization time, and are instructed by special NP instructionsto copy a data unit to internal memory (from internal memory in the caseof a TX). The peripheral FIFOs are connected to the doorbells and thetask scheduling mechanism.

[0416] Accelerators/Coprocessors interface. In general, there may be twokinds of accelerators/coprocessors: (1) accelerators/coprocessors thatare tightly connected to the network processor core and that areaccessed via a special agent instruction (e.g. CRC, multireader, messagesender, etc.). These reside within network processor Compound entity;and (2) accelerators/coprocessors that are ring members and can beaccessed by any other ring member interposed on the ring (via messagesover the ring).

[0417] Host (PP) processor interface. In general, the PP will be able toinitialize NP configuration registers, to share data with the NP ininternal and external memories, to request services from an NP task, andto receive interrupts and messages from the NP.

[0418] Instruction set. Instructions perform the various types ofactions, such as the following: arithmetic, logic, registermanipulation—modify data in registers; load/store—move data between SRAMand registers; flow control—changes in the program counter; taskmanagement—control of inter-task changes in the program counter; agentinterface instructions—DMA (move data between the SRAM and the SDRAM),access to serial ports (move data between the SRAM and communicationperipherals), and accelerators (specialized communication processingfunctions such as a CRC calculation on a block of data); special purposeregister moves (and activation of coprocessors)—move data between GPRsand SPRs.

[0419] Execution Flow and Performance Considerations

[0420] Generally, the CPU executes instructions sequentially until itencounters an instruction which changes the program flow. For example,this instruction can be a conditional or unconditional branch or jumpwithin the task, which checks a condition bit in one of the generalpurpose condition registers, or an instruction which terminates thecurrent task and starts execution of another task. Instructions whichcause a non-incremental change to the program counter take more then onecycle and are optionally followed by a one instruction delay slot. Otherinstructions which influence the program flow are: arithmetic andcompare instructions which modify the condition code bits, andinstructions which modify the task entry point (the address from whichthe task will resume execution in its next execution round).

[0421] Types and states of tasks. Tasks can be in one of three states:running, pending and dormant. At any given time there is one runningtask executing on the CPU. When something requests the service of atask, the task becomes pending. Each time the running task voluntarilyyields the CPU, the highest priority task is selected from the pendingtasks. Tasks for which nothing has requested their service are dormant,and they will not be enabled for execution and will not run. Accordingto one embodiment of the invention, the number of tasks is determined atinitialization time and there is no dynamic creation/elimination oftasks.

[0422] Tasks can be classified by the reason (trigger) that causes atask to become enabled for execution. In other words, tasks can beclassified by the entity which they serve:

[0423] Peripheral. a task which serves a communication peripheral. Eachtime the RX peripheral receives a unit of data (e.g., 64 bytes of anEthernet frame) in its FIFO or when a TX peripheral has space for a unitof data available in its FIFO, that peripheral sends a service requestto their servant task.

[0424] Timer. A timer can be preprogrammed with a period cycle count.Each time it periodically expires, the timer sends a service request toits servant task.

[0425] Inter-task messages. Data (usually stream data ) can be exchangedor handed over between tasks. One approach for this is to send a message(e.g., containing the data pointer) to the other task, accompanied by aservice request. Usually, a task serves only one master (the masterbeing the source of service requests). This means that peripherals,timers and inter-task messages can all request service in the samemanner.

[0426] There are two more sources which can cause a task to becomepending:

[0427] DMA. A task is permitted to yield the CPU during a DMA request(in this way the DMA will work in parallel with the CPU, and the CPUwill not be stalled). The task usually wants to resume execution whenthe DMA action is completed. Upon completion, the DMA will send aservice request to the originating task.

[0428] Self-request There is a limit to an execution period (the timebetween two sequential task switch events) of tasks. The execution ofthe current task usually may not be preempted by an external event, soit is the programmer's responsibility to provide for yielding the CPUbefore reaching the time limit per task. When a task yields the CPU(e.g., to allow another task to execute) before it has completed theintended work, the task can issue the self-request service requestbefore yielding in order to schedule itself for future execution.

[0429] Task Triggers and Task Doorbell Bits

[0430] Task doorbell bits are the place where the service requests areregistered. A network processor task can be enabled for execution byseveral request sources:Ordinary priority request from a serial module(e.g., a data fragment is ready in the receive FIFO and was copied to apredefined SRAM location or the transmit FIFO finished the transmissionof the previous data fragment.).

[0431] High priority request from a serial module. (e.g., the RX FIFOover a threshold or the TX FIFO under a threshold).

[0432] Completion of DMA requests.

[0433] Self-request (produced by the software).

[0434] Message from another task (produced by the software and using thesame doorbell bit as an ordinary priority request from a serial module).

[0435] Message queue above threshold (produced by the software and usingthe same doorbell bit as the high priority request from a serialmodule).

[0436] Timer (uses the same doorbell bit as the ordinary priorityrequest from a serial module).

[0437] According to one aspect, for each doorbell bit there is a maskbit. The exceptions are the first two doorbell bits, which have a commonmask bit, and the self request bit, which can not be masked. If the maskbit is set, the task will be enabled for execution by the matchingrequest; otherwise, the request is blocked.

[0438] According to one approach, about twelve tasks are expected toserve serial channels (e.g., 6 for receive and 6 for transmit). Thesetasks will usually be activated by requests from serial channels. Therest of the tasks are expected to be activated by timers, messages fromother tasks, or the host (e.g., doorbell bits 1 and 2).

[0439] A task which has more work to do then the maximum allowablelatency should yield and use the self-request (doorbell bit 5) to bescheduled again (e.g., a timer handler task). Any task can be activatedby a completion of a DMA request that the task originated.

[0440] When a task is scheduled for execution, the request and mask bitsof the service request that activated the task are cleared. In the casewhere there are regular and urgent bits, both are cleared.

[0441] Mask Bits and DMA

[0442] Mask bits can be set by software, and, in some cases, they areset automatically by hardware. A mask bit, together with the associatedrequest bit, is cleared by hardware when the request is served by thetask (the task becomes running). Mask bits can be set with a specialinstructions and can optionally be specified in DMA and YIELDinstructions. When a task issues a DMA request and this DMA is not thelast action in the task, the programmer should set a DMA doorbell maskbit and clear all other mask bits (this task should not return toexecution because of any other request, for example the serial.). Whenthe task returns to execution after completion of the DMA, all mask bitswill be clear.

[0443] According to one approach, there is a default state of the maskbits for all tasks, with the first bit set and all the others cleared.Another option, the auto set in DMA and YIELD instructions, instructsthe hardware upon DMA completion to set the mask bits to the defaultstate. When a task issues its last DMA request, it sets the auto setindication. The last YIELD instruction of a task should also set themask bits to the default state.

[0444] According to one approach, the network processor DMA is able toserve two external busses (it can be a single DMA machine in someimplementations.) An immediate DMA ID field is specified in DMAinstructions. Its value is an index into a translation table (the tablemay be programmed by the CPU or by writing to special purpose registerson the network processor). The translation result contains informationlike: big/little endian, and so forth. When all the DMAs initiated by atask (DMAs for which acknowledgement was requested) are complete, theDMA doorbell request bit is set.

[0445] Using a count field in one of the special purpose registers, itis possible to yield if all DMAs of the task have not been completed.Also, when a DMA instruction is executed, and there is no place in thepending DMA transactions queue, it is possible that the networkprocessor may be stalled.

[0446] Task Priority and Scheduling

[0447] Each time the current task suspends its execution, the hardwarescheduler selects from the pending tasks the one with the highestpriority, and starts execution of that task. Various approaches could betaken to task scheduling. According to one approach, the algorithm forselecting the next task for execution is as follows. The tasks whichparticipate in the selection of the next task for execution are thetasks for which their corresponding mask bit in the Task Global MaskRegister (TGMR) is cleared. Tasks which participate in the selection ofthe next task and have unmasked requests are divided in to four groupsand served in the following order:

[0448] 1. Highest priority group: includes urgent requests of tasknumbers 0-31.

[0449] 2. Second priority group: includes regular requests of tasknumbers 0-31.

[0450] 3. Third priority group: includes urgent requests of task numbers32-63.

[0451] 4. Lowest priority group: includes regular requests of tasknumbers 32-63.

[0452] Within each group, the requests are serviced according to thetask number. Lower task number requests are served before higher tasknumber requests. The task resides in the higher priority class, startingfrom the time the urgent doorbell bit was set, until the time itsdoorbell mask is set to default by an option of the yield instruction,or until its doorbell mask is explicitly cleared by an instruction.According to one approach, the tasks are in an urgent state as long asthe handling of all pending urgent events is not completed (includingwhen the task yields while doing a DMA during such a period).

[0453] When a task starts execution, the doorbell request bit whichcaused it to run and the matching mask bit are cleared. The otherrequest bits are not modified. The regular and the urgent request bitsare considered to be two levels of the same request and have a commonmask bit. They are both cleared when the request is serviced. A task canexplicitly raise its priority to urgent, and return its priority tonatural (normal priority, unless there is an urgent request pending) byusing an agent instruction that writes to the doorbell register. Thiscan be used to increase task priority for the period spent in a criticalsection or in an urgent code fragment.

[0454] Task Switching Performance

[0455] According to one aspect, instructions that yield the CPU take 2cycles (they have a delay slot). The other performance issue is the timeit takes to restore the registers of the new task. Usually the registersof the next task are pre-loaded during the execution of the currenttask.

[0456] Inter-task Communication

[0457] Global registers. A global register is a general purpose registerthat is shared between all network processor tasks, and which can besafely used and modified by each task. (A task has to make sure that itcompletes the whole sequence, which includes the shared registeruse/update, needed for the action performed, before yielding the CPU.)

[0458] Inter-task messages. Sending messages between tasks is done usingqueues. Additional information is provided in the discussion regardingdata structures.

[0459] Common program. More then one task can execute the same objectcode, for example, such as two tasks that service the reception of twoidentical serial channels. Also, all tasks can share code in functions.

[0460] Internal and external memory. Sharing information in memory is amatter of convention between the tasks. For complex atomicmodifications, it is possible to either have a server task with anexclusive right to access the structure or to use semaphores asdescribed further below. (Complex atomic means that the modificationrequires a series of external memory accesses, between which the datastructure is in an inconsistent, i.e., erroneous, state.) An example ofa need for such a modification would be the update of a linked listqueue whose descriptor is in external memory. Generally, it isrecommended to avoid using such structures when possible.

[0461] Host-Network Processor Communication

[0462] Network Processor task to host messages and interrupts. Describedin connection with the discussion on data structures.

[0463] Host to Network Processor task messages. The host is able to posta message to the input message queue of any task. The host also sets thedoorbell bit of the target task. The host should not post messages to aninput message queue to which a network processor task posts messages.

[0464] According to one approach the network processor, either with ahardware mechanism or a software task, should notify the host when thehost message queue changes its position relative to a close to fullthreshold. Using such a threshold will permit a less time-constrainedhandling of messages on the network processor side and eliminates theneed for a check if not full inquiry on the host side.

[0465] Host to Network Processor commands. There is a command registerthat is written to so that the host can control network processorexecution. For example, such commands may include a reset, an activatetask N, a deactivate task N (without aborting its current execution),and a start execution of task N (i.e., give task N a request withoutaborting the currently executing task).

[0466] Host-network processor parameters. According to one approach, foreach task an area is allocated at compilation time to hold theparameters that are initialized by the host and used by the task. Theaddresses of these areas are maintained together with the frame pointersand the entry points, and are loaded by the boot initialization routine(into R6, discussed further below) of each task. These parameters arealso read by the host, and are used in the initialization drivers.

[0467] State Resources

[0468] General Purpose Registers

[0469] According to one approach, there are 32 general purpose 32-bitregisters to be used by the tasks. Some of the registers, r0-rN, do notpreserve their values across task switching; they are common to alltasks. These are referred to as common registers. The other registers,rN+1-r31, are preserved across task switching. These registers arereferred to as private registers. According to one embodiment of theinvention, these private registers are saved and restored from theirshadow location by the hardware, transparently to the programmer. N is aglobal value, preferably programmed at initialization time. According toone approach, N (which should be odd) is 15, although other values of Nmay be used depending on design considerations. The programmer shouldallocate the correct shadow area for the registers, which should be thenumber of tasks multiplied by the number of private registers. Theprogrammer should use registers contiguously, starting from r31downwards.

[0470] According to one aspect of the invention, some of the registershave special hardware support, as follows:

[0471] r0 is interpreted as constant 0; writes are ignored.

[0472]FIG. 41 illustrates register r1 (430) in greater detail inaccordance with at least one embodiment of the present invention.

[0473] r1 condition codes: sticky condition (I bit); arithmeticconditions (equal/zero [I bit], less than/negative [I bit], greaterthan/positive [I bit], carry [I bit], overflow [I bit], doorbell bits [6bits], and user defined condition bits [16 bits]).

[0474] r31: user defined condition codes (32 bits).

[0475] r30: entry point address of the task.

[0476] r28: link address 1 (function return address).

[0477] r29: link address 2.

[0478] According to one approach, the convention for register allocationis similar to the approach taken for application binary interfaces, orABI. ABI is a standard that allows object code interoperability offunctions compiled by different compilers or written in differentlanguages. Register allocation according to this approach is as follows:

[0479] r27 and other r2x registers (26>2x>20) are allocated to a fixedmeaning. Registers which are allocated to some meaning by convention areexpected to maintain the meaning over function calls. They can bemodified within functions, but only according to their meaning. Eachtask might have different registers allocated to fixed meanings.

[0480] r27: parameter area pointer and stack pointer of the task. Thecompiler or the programmer statically allocates up to three stack framesper each task. The compiler computes the area used by level0 code (firstframe), and the maximum area needed for automatic variables of level1functions of the task (second frame) and of level2 functions of the task(third frame). There is a global limit of memory size of local functionvariables (enforced by the compiler). Whenever there is an indirectfunction call, the maximal stack frame will be allocated. All accessesto local variables will be translated by the compiler to offsets on r27,and there is no need for a stack pointer register for dynamicallyallocating frames on the stack and for modifying the stack pointerduring function calls and returns.

[0481] According to one approach, the compiler limits the function calldepth to two. The compiler may also identify those functions which donot yield and do not call other functions, allocate their frame in anarea common to all tasks, and use absolute addresses to access localvariables (this may save memory per task in this case). Other registerscan also be allocated by convention to: data unit address in internalmemory, data unit pointer in external memory, connection table baseaddress, and so forth. Registers which are allocated to some meaning byconvention are expected to maintain the meaning over function calls.Such registers can be modified within functions, but only according totheir meaning.

[0482] r16, r17: These registers do not preserve their value over anyfunction call. They can be used without saving in level2 functions andin level1, which do not expect the value to be preserved over a level2function call. The r16 and r17 registers are used to pass parameters andget results to/from levels and level2 functions. Even in the case whenthere are no parameters passed, these registers do not preserve theirvalue over any function call. Preferably, the compiler forbids functionsof more than two parameters.

[0483] The compiler and the assembly programmer may use the r16, r17order for level1 functions and the r17, r16 order for level2 functions.This may eliminate saving and restoring of r16 when both level1 andlevel2 functions have a single parameter. Also, r16 and r17 are the onlyprivate registers which can be modified in level2 functions.

[0484] r18-r19: These registers should not be modified within level2functions. They can be used without saving in level1 functions, and theydo not preserve their value over level1 function calls.

[0485] r20-r26: These registers should not be modified within level1 andlevel2 functions. These registers can be used without saving in level0code. Some of these registers can be assigned to a fixed meaning, inwhich case they can be modified within functions according to theirfixed meaning.

[0486] r0-r15 are scratch or global registers that are common to all thetasks, and which are not changed by the hardware task switching.

[0487] r2-r5 hold information that is frequently used and shared betweentasks, such as the buffer array base address (r2) and the free bufferpool address (current) (r3). These registers can hold popular (oftenused) constants, such as a table base address or an arithmetic constant.

[0488] r8-r15 are used to hold information which does not need to bepreserved across yields, such as intermediate results of an arithmeticcomputation.

[0489] r6-r11 do not preserve their value over function calls.

[0490] r12, r13: These registers preserve their values over calls tolevel2 functions which do not yield.

[0491] r14, r15: These registers preserve their values across calls tolevel1 and level2 function which do not yield.

[0492] Table 3 summarizes the register conventions discussed above.TABLE 3 private or special HW fixed modified by used as common handlingmeaning functions parameter r0 Common constant 0 NA Yes No r1 Commonconditions No Yes No r2-r5 Common No Part within fixed No meaning r6-r11Common No No level 1 & 2 No & yield r12, r13 Common No No level 1 & Noyield r14, r15 Common No No No No r16, r17 Private No No level 1 & 2 Yesr18, r19 Private No No level 1 No r20-r27 Private No Part No No r28Private level 1 return No No No address r29 Private level 2 return Nolevel 2 No address r30 Private entry point NA Yes (TBD) No r31 Privateconditions No Yes (TBD) No

[0493] By way of summary, registers can be safely used in the followingcases:

[0494] r8-r9: level2 function code which does not contain a yield;level1 function code which does not contain a yield or a call to alevel2 function; and level0 code which does not contain a yield or afunction call.

[0495] r10-r11: level1 function code which does not contain a yield or acall to a level2 function which yields.

[0496] r12-r15: level0 code which does not contain a yield or a call toa function which yields.

[0497] r16, r17: any level2 function code; level0/1 function code whichdoes not contain a function call.

[0498] r18, r19: any level1 function code; level0 code which does notcontain a function call.

[0499] r20-r2X: any level0 code.

[0500] Indication Registers

[0501] According to one approach, registers r1 and r31 containindications which can be used in branch conditional instructions. Theycan be explicitly updated by any instruction, but some of the bits in r1are implicitly updated by compare instructions and by arithmetic/loadinstructions. The carry bit is also implicitly updated by somearithmetic instructions.

[0502] RI is a global register; its value is not preserved after taskswitching. R31 has a copy per task.

[0503] The doorbell and mask fields in r1. The doorbell sub-fieldcontains a copy of the doorbell bits of the current task. The mask bitsare a copy of the task's mask bits. Writes to these fields are ignored.

[0504] Compare instructions, the sticky bit options. Compareinstructions modify the three condition code bits, LT, EQ, and GT.Optionally, the compare instructions can also update the sticky bit.These instructions specify a condition, such as one of NONE, LT (lessthan), LE (less than or equal to), EQ (equal to), NE (not equal), GT(greater than), or GE (greater than or equal to). If the condition issatisfied by the compare, the sticky bit is set; otherwise, the stickybit is not altered. This feature is useful to efficiently implementseveral tests of error cases as well as other AND/OR conditions. Compareinstructions also have an option to overwrite the sticky bit. FIGS.87-90 (discussed below) illustrate various mechanisms for using theaccumulative condition flag, i.e., the sticky bit, to execute branchinstructions in processing systems, such as a network processor orcommunications processor.

[0505] Serial status. The serial status indications (e.g., error,over-run/under-run, and last), optionally together with the datafragment size, should be loaded by the programmer from a fixed memorylocation into r1 or r31.

[0506] User defined indications. The user can keep state information inthe user-defined part of r1 or r31. It may be desirable for anindication to be created once and used several times. The user can alsoload to r1 or r31 a part of an array of indications.

[0507] Arithmetic instructions modify the condition codes. Arithmeticinstructions can modify the zero, negative, and positive condition codebits. The following arithmetic instructions modify the carry conditioncode bit: ADD, SUB, ADDI, SUBI, SRR, SLR, SLI, SRI, and CLB

[0508] Branch, jump and yield conditional Conditional branch/jump andyield instructions test a single condition bit, which can be any bit inr1 or r31, and compare that bit to either 0 or 1. Conditionalbranch/jump instructions take three cycles when taken and 1-2 cycleswhen not taken, while unconditional branch/jump instructions take twocycles; in both cases they have an optional delay slot. Conditionalinstructions. In most of the instructions the 3-bit conditionalexecution field is used to specify whether the instruction isunconditional or it is conditional upon the sticky condition bit beingtrue or false. One of the three bits is reserved for future use.

[0509] Link Registers

[0510] Branch/jump instructions can be used to call subroutines. Theyhave an opcode bit which specifies whether the return address is to besaved, and another opcode bit which specifies whether the return addressshould be saved in r28 or r29. The return address is either PC+1 orPC+2, depending if the delayed branch option is used. The function calldepth is limited to two, and the depth of each call/return is specifiedin the instruction. Functions which do not call other functions shouldbe defined and called as depth 2.

[0511] The Task's Entry Point Register

[0512] R30 contains the address at which the task will resume executionafter a yield. It is modified by any instruction which modifies r30 andis optionally modified by the YIELD instruction. It can optionally bemodified by DMA instructions which yield.

[0513] Hidden Registers

[0514] Program counter—according to one approach, there is a singleprogram counter in the system (not per-task) and it is not directlyaccessible by the software in any manner.

[0515] Special Purpose Registers

[0516] Special Purpose Registers (SPRs) are network processor coreregisters that are not defined as one of the General Purpose Registers(GPRs). Special instructions (SPRL and SPRS) are defined to enable themovement of data between SPRs and GPRs. Special Purpose Registers in thenetwork processor include the Refetch SPR 440, the Task SPR 442, theTrap SPR 444, and the Mindex SPR 446, as shown in FIG. 42.

[0517] Refetch SPR 440. The refetch SPR is a 32-bit register that holdsthe first and second program memory addresses of the instructions to berefetched when getting out of a trap. Bits 15:0 hold the firstinstruction address (called refetch) and bits 31:16 hold the secondinstruction address (called next_refetch). When the network processorreceives a break request and is not already in the trap mode, itcontinues instruction execution from the program location pointed out bythe break vector and the trap mode bit is set (in the task SPR). Theaddress of the instruction that would have been executed but for theoccurrence of the breakpoint is saved in bits 15:0 of the refetch SPR.The following instruction that was supposed to be executed but for theoccurrence of breakpoint is saved in bits 31:16 of the refetch SPR.

[0518] Leaving the trap mode is performed by executing the RFTinstruction. This instruction causes a program jump to the programlocation specified by the refetch SPR bits 15:0, followed by the programlocation specified by the refetch SPR bits 31:16. This also clears thetrap mode bit.

[0519] The refetch SPR is a read/write register that can be accessedthrough the SPRL and SPRS instructions.

[0520] Task SPR 442. The task SPR is a 32-bit read only register. Thetask SPR contains information on the current executing task and on thenext task to be executed:

[0521] DOORBELL REQ reflects the doorbell request bits of the currenttask.

[0522] CTID reflects the Current Task ID.

[0523] NTID reflects the Next Task ID.

[0524] NTV reflects Next Task Valid bit.

[0525] MASK reflects the doorbell mask bits of the current task.

[0526] UR reflects the urgency level of the task (1=urgent).

[0527] COUNT reflects the doorbell counter value of the current task.

[0528] When there is a yield and both the bump buffer is empty and thecontext of the next task is already pre-loaded, the network processorswitches to the next task. At this point the NTID is loaded into theCTID and the next task ID together with the next task valid bit from thedoorbell are sampled into the NTID and into the NTV, respectively.

[0529] If the NTV bit is set, then the NTID is locked and there will notbe further sampling. If the NTV bit is cleared, then the doorbell nexttask ID will continue to be sampled on each cycle until the valid bit isset.

[0530] The new valid next task ID is used by the pre-load logic topre-load the next task's context. The task SPR can be read by using theSPRL instruction. All other bits of the task SPR are reserved and willbe read as zero. The CTID, NTID and NTV bits are cleared by reset. Thedefault state (and the reset state) of the mask of each task is 0b100.

[0531] Trap SPR 444 The trap SPR is a 32-bit register. The trap SPRinclude the trap mode bit, the illegal instruction status bit, and thebreakpoint status bits:

[0532] Bit 0—Illegal Instruction (IL): When there is an illegalinstruction, the IL bit is set. The IL bit can be cleared only by reset.

[0533] Bit 1—Trap Mode (TRAP). When TRAP bit is set, the networkprocessor is in the trap mode. A breakpoint event causes the programflow to jump to a program location (pointed to by a given vector) and toenter the trap mode of execution by setting the trap mode bit. When intrap mode, no breakpoint and/or patch events will be accepted. The trapmode bit will be cleared by a RFT (Return From Trap) instruction or bywriting zero to the trap mode bit. When the trap bit is cleared, furtherbreakpoints and/or patches will be accepted.

[0534] Bit 2—Program Address Break (PAB). This is a breakpoint statusbit, which when set, indicates that a program address breakpointoccurred. This bit is cleared by an RFT instruction or by writing zeroto it.

[0535] Bit 3—Data Address Break (DAB): This is a breakpoint status bit,which when set, indicates that a data address breakpoint occurred. Thisbit is cleared by an RFT instruction or by writing zero to it.

[0536] Bit 4—Task Break (TB): This is a breakpoint status bit, whichwhen set, indicates that a task ID breakpoint occurred. This bit iscleared by an RFT instruction or by writing zero to it.

[0537] Bit 5—Yield Break (YB). This is a breakpoint status bit, whichwhen set, indicates that a yield breakpoint occurred. This bit iscleared by an RFT instruction or by writing zero to it.

[0538] Semaphores

[0539] Semaphores are commonly used when a section of code that containsyields should not be executed by more then one task at a time. Thishappens when the code is handling some data structure resource that isshared between tasks. Current examples which might entail the use ofsemaphores are: adding and removing from a linked list queue whosedescriptor is in external memory; releasing a multicast buffer (updateof the reference count); emulation of a task's message queue in externalmemory; and a task that tries to put an inter-task message into a fullmessage queue can use the hardware mechanism to wait until the queue isnot full.

[0540] The alternative solution of not yielding while in the criticalsection is not efficient. The alternative solution of having a dedicatedtask responsible for the resource, and thus serializing the actionsperformed on the resource, is in some cases complicated to implement andis in some cases inefficient.

[0541] Network processor software semaphores in accordance with thepresent invention are implemented over a hardware mechanism which makesit possible to prevent the scheduling of tasks specified in a bitmap(the TGMR register).

[0542] The number of semaphores is limited only by size of the memoryspace allocated for semaphore support. Every semaphore requires a onebyte indication of free/busy state plus a 64-bit mask of tasksregistered for the particular semaphore. While performing the criticalsection protected by a semaphore, the task's priority should be raisedand also all issued DMAs should be treated as urgent in order tominimize semaphore holding time.

[0543] There could not be too many semaphores in the system (e.g., inorder to comply with the goal of keeping the internal memory requirementreasonable), yet there are many shared external memory resources (dataqueues, contexts, lookup tables, etc.) that may require semaphoreprotection. According to one approach, the semaphore ID (number) ischosen based on a simple arithmetic operation (e.g., a MOD ofsignificant bits) on the resource address.

[0544] The network processor scheduler hardware includes a bitmap in anSPR register (SPR bitmap). Each bit in the bitmap, when set, preventsthe scheduling of the task whose ID corresponds to the bit index. Thenetwork processor software can add or remove a list of tasks specifiedin the specified in a software bitmap to the above list. The softwareregisters in the SPR bitmap those tasks which are prevented fromexecution because they are waiting for one of the currently occupiedsemaphores (see bad_list below).

[0545] The software holds an indication in internal memory for eachsemaphore that indicates whether that semaphore is currently inuse/occupied (see semX_indic below.) The software also holds for eachsemaphore a 64 bit bitmap corresponding to the tasks that are currentlyawaiting access to the semaphore (see semX_mask below). For each taskawaiting the semaphore, this bit, which corresponds to that task's ID,is set.

[0546] According to one embodiment (not reflected in the table below),the software also holds the task ID of each task in the form of a 64 bitmask (where only the bit corresponding to the task ID is set in thismask).

[0547] The following pseudocode in Table 4 illustrates the use of asemaphore: TABLE 4 Pseudocode Illustrating the Use of a Semaphore bad_list - hardware 64-bit mask indicating which tasks can not be run. semX_indic - software indication per each semaphore (X) that indicateswhether it is  occupied.  semX_mask - software 64-bit mask per eachsemaphore (X) comprises registration of the  waiting tasks. produceX(semld) from the resource address checkX: ; This is the frequently usedcode fragment - efficiency   is vital.  1d.b r2,semX_indic ; load the“semaphore is busy” indication - a byte or     a bit.  bc.neqsem_occupied ; and test it.   ; Do the critical section code and releasethe     semaphore.  sti Oxff,semX_indic ; If it was not occupied, grabit and do the critical     section.  seturg on  CRITICAL  SECTION X seturg off  sti 0, semX_indic ; Release the semaphore  clear semX_maskbits in bad_list ; agentw. Let all in, highest priority task will be    selected.  . . . ; Rest of the task code and yield. sem_occupied: ;Register myself on the semaphore, and prevent myself   from running. 1d.d r2,r3,semX mask ; Get the 64-bit mask of tasks waiting for this    semaphore.  set bit of current task in r2,r3   ; “Optimization”: thecurrent task_id is prepared in a     doubleword mask in the initroutine.  st.d ;2,r3,semX_mask ; Save the mask for common use.  setsemX_mask bits-in bad_list ; agentw. Prevent everyone (and myself) whois     waiting to semX from being scheduled in.  set my task's doorbellbit ; Re-activate my request  yield.epsem_released ; Go to sleep untilit is my turn to use the semaphore. sem_released: ; The semaphore washeld by someone, but now it     might be free.  1d.d r2,r3,semX_mask ; clear bit of current task in r2,r3 ;  st.d r2,r3,semX mask ;  setsemX_mask bits in bad_list ; agentw. Prevent everyone else who iswaiting to     semX from being scheduled in.  b  checkX ; Re-check thelock - avoids nasty bugs.

[0548] TABLE 4 Pseudocode Illustrating the Use of a Semaphore bad_list -hardware 64-bit mask indicating which tasks can not be run. semX_indic -software indication per each semaphore (X) that indicates whether it isoccupied. semX_mask - software 64-bit mask per each semaphore (X)comprises registration of the waiting tasks. produce X(semId) from theresource address checkX: ; This is the frequently used code fragment -efficiency is vital. 1d.b r2, semX indic ; load the “semaphore is busy”indication - a byte or a bit. bc.neq sem occupied ; and test it. ; Dothe critical section code and release the semaphore. sti Oxff, semXindic ; If it was not occupied, grab it and do the critical section.seturg on CRITICAL SECTION X seturg off sti 0, semX indic ; Release thesemaphore clear semX mask bits in bad_list ; agentw. Let all in, highestpriority task will be selected. . . . ; Rest of the task code and yield.sem_occupied: ; Register myself on the semaphore, and prevent myselffrom running. 1d.d r2, r3, semX mask ; Get the 64-bit mask of taskswaiting for this semaphore. set bit of current task in r2,r3 ;“Optimization”: the current task_id is prepared in a doubleword mask inthe imt routine. st.d ; 2, r3, semX mask ; Save the mask for common use.set semX mask bus-in bad_list ; agentw. Prevent everyone (and myself)who is waiting to semX from being scheduled in. set my task's doorbellbit ; Re-activate my request yield.epsem_released ; Go to sleep until itis my turn to use the semaphore. sem_released: ; The semaphore was heldby someone, but now it might be free. 1d.d r2, r3, semX mask ; clear bitof current task in r2, r3 ; st.d r2, r3, semX mask ; set semX mask bitsin bad list ; agentw. Prevent everyone else who is waiting to semX frombeing scheduled in. b checkX ; Re-check the lock - avoids nasty bugs.

[0549] The general operation of the use of semaphores is as follows.Whenever a task seeks to enter critical section number X, the taskchecks the internal memory indication of semaphore X to determine ifthere is currently any other task in the critical section.

[0550] If the semaphore indication is clear, the task sets theindication and enters the critical section. After completion of thecritical section (e.g., which contains external memory accesses and taskswitches), the task clears the semaphore indication. It is possible thatwhile the task was in the critical section other tasks may haveregistered themselves as awaiting access to the semaphore and preventedthemselves from being scheduled in by the hardware scheduler. So thecurrent task will enable these other tasks, which are registered asawaiting scheduling for the semaphore, by removing their list from thehardware bitmap.

[0551] If the semaphore is set, the task branches to semX_occupied,registers itself in the list of tasks awaiting the semaphore, anddisables those tasks by adding the list to the hardware bitmap. Taskswitching is then initiated after setting the resumed execution in thesemX_released label. When the task resumes execution, the taskderegisters itself from the list of tasks that are awaiting thesemaphore, and prevents other tasks on the list from being scheduled byadding them to the hardware bitmap. The task then executes the code,which checks the semaphore indication.

[0552] In accordance with one embodiment of the present invention, amethod of employing semaphores to limit access to a shared resource usedby a multi-tasking processor is provided. The method comprises the stepsof providing a first bitmap in a register that prevents specified tasksfrom running because the specified tasks are awaiting access to anoccupied semaphore, storing an indication in memory that indicateswhether the semaphore is occupied, storing a second bitmap in memorythat identifies tasks that are awaiting access to the semaphore, andattempting to access the semaphore based on checking the indication inmemory. Wherein a task checking the indication in memory determines thatthe semaphore is available, the method can further comprise the steps ofsetting the indication to indicate that the semaphore is occupied andperforming the processing for the task, wherein performing theprocessing for the task includes critical section execution. Thecritical section can include at least one of external memory accessesand task switches.

[0553] The method can further comprise the step of resetting theindication to indicate that the semaphore is available after the step ofperforming the processing for the task. Furthermore, the methodadditionally can comprise the step of removing from the first bitmapthose tasks now included in the second bitmap in memory that identifiestasks that are awaiting access to the semaphore, thereby allowing thosetasks to be scheduled for access to the semaphore.

[0554] In one embodiment, when a task checking the indication in memorydetermines that the semaphore is occupied, the method can furthercomprise the steps of including the task in the second bitmap andrevising the first bitmap to reflect the tasks from the list in thesecond bitmap. The method further can include the steps of removing thetask from the second bitmap when the indication reflects that thesemaphore is available and revising the first bitmap to reflect thetasks from the list in the second bitmap, thereby allowing the task toaccess the semaphore and perform the task processing.

[0555] In accordance with another embodiment of the present invention, asystem employing semaphores to limit access to a shared resource used bya multi-tasking processor is provided. The system comprises a firstbitmap in a register that prevents specified tasks from running becausethe specified tasks are awaiting access to an occupied semaphore, anindication in memory that indicates whether the semaphore is occupied, asecond bitmap in memory that identifies tasks that are awaiting accessto the semaphore, and means for attempting to access the semaphore basedon checking the indication in memory, The means for attempting can be aprocessor executing a task, wherein the task can be enabled to accessthe semaphore when the indication reflects that the semaphore isavailable. Also, the task can be enabled to register itself with thesecond bitmap and updates the first bitmap when the reflects that thesemaphore is occupied. The task execution can include processing acritical section including at least one of external memory accesses andtask switching, wherein the indication in memory is reset to indicatethat the semaphore is available after processing the critical section.

[0556] The Software Data Model

[0557] Referring now to FIG. 43, an exemplary software data model 450 isillustrated in accordance with at least one embodiment of the presentinvention. There are two major types of data allocated in internalmemory: global data and task/function data.

[0558] Global Data:

[0559] .adata start

[0560] global data definitions, examples:

[0561] long generic_taskmessage_q[8]

[0562] .struct_structure_name instance_name;

[0563] .adata end

[0564] Global data has a global name scope and can be symbolicallyreferenced from anywhere in the code. References are translated toabsolute addressing.

[0565] Task/Function Data.

[0566] .task [common] task_type_name

[0567] task data definitions and task code.

[0568] .task end [task_type_name]

[0569] .func level1/2 function_name

[0570] function data definitions and function code.

[0571] .func end [function_name]

[0572] Local data definitions have a local name scope (detailed below)and references are translated by the assembler to r27+immediate offset.Functions can be defined either within a task definition or outside ofany task definition. Function names, which are defined outside of anytask definition, have global name scope and can be called from any placein the code. They can access their local data and the global data.Function names which are defined within a task definition have a scopeof the task definition. They can be called only by level0 code of thattask type. They can access the common data of the task (detailed below).

[0573] There is hardware support for keeping return addresses for twolevels of nesting of function calls. A static stack frame will bemaintained, made of three parts, for each task instance. This shouldsolve the problem of allocation of the correct size of dynamic stacks.It will also make function calls more efficient by eliminating handlingof the stack pointer and of the return address. This means that atdefinition time the level (1 or 2) of each function is specified.Functions which do not call other functions will be defined as level2functions.

[0574] For each task type, the assembler creates two data sections,level0 data and levels data. Their sizes will be used by the PP softwareto allocate memory for the static frame of each task instance of thistask type, and to initialize r27 of the task instance. A task definitioncan appear several times for the same task type. Such a definition shallbe referred to as a task fragment. The data definitions in each of thefragments are in union with the data definitions in each of the otherfragments (overlap, occupy the same memory location).

[0575] During a task fragment definition, an optional common keyword canbe used, in which case the data definitions will overlap with any otherdata definitions, and the scope of the data names will be all thefragments of the same task type.

[0576] The non-common fragments of a task can be used to implement thedifferent functions (referred to as handlers), which the generic taskdoes. The pointer to the handler is passed in the inter-task message.All the handlers will return to a label in the common part of the task.The common part of the task will only handle the input message queue anddispatch to the handlers.

[0577] The size of the level0 frame for a task type is the size of thedata definitions in the common part plus the maximum of the sizes of thedata definitions in non-common fragments of the task type.

[0578] Level1 functions can be called only explicitly (i.e., they cannot be called using a pointer.) The assembler will find all the calls tolevel1 functions and will compute the level1 frame size for this tasktype as the maximum of the sizes of the data definitions of level1functions called by this task type.

[0579] Level2 functions can be called via a pointer. The assembler willcheck that the data allocated in each level2 function is not more then asystem level constant (80 bytes) and will add this constant to theoffsets of data definitions of level1 functions.

[0580] Scope of labels: local in functions and task fragments. Global toall fragments of that task type when in the common task fragment. Labelsin task fragments and level2 function names can be passed to the PPsoftware (flow manager) in the object file using the directive: .exportlabel_name.

[0581] According to one approach, the assembler will produce a singlecode section, which will contain the code of all the tasks andfunctions. Other function types might be considered, such as ones whichdo not have local data in memory or which receive as a parameter apointer to a scratchpad area for their use. Also to be considered iscode which is not associated with tasks and functions. (All the labelsin this code will have global scope. It might be used for additionaltypes of functions.) In cases when the caller's frame is no longerneeded (an error condition, for example), it might call a function ofthe same level, which will use the caller's frame.

[0582] The Instruction Set

[0583] Addressing modes:

[0584] Instruction addressing. All instruction addresses are wordaddresses, they are shifted left 2 bits to generate the memory address.

[0585] Absolute: Jump to the absolute address specified in the 16-bitimmediate instruction field.

[0586] PC relative: Branch to an offset from the current program counterspecified in the 12-bit immediate signed instruction field.

[0587] Register: Jump to the address, which is contained in the registerspecified in the instruction.

[0588] Implicit task entry point: During task switch, jump to the entrypoint of the next enabled task (in r30 of that task).

[0589] Data addressing: Data addresses are byte addresses that are takenas is, regardless of the access size.

[0590] Register with offset The address is the sum of the valuecontained in the register, with the sign extended 8-bit immediateinstruction field.

[0591] Register with index register: The address is the sum of the valuecontained in the register, with the value contained in the indexregister.

[0592] Instruction Groups

[0593] According to one embodiment of the packet processor of thepresent invention, the following instruction groups are supported:arithmetic and logic operations; register data manipulation; load/store(to internal memory); program flow; task yielding; and agentinstructions (DMA, communication peripherals, CRC, CAM, etc.).

[0594] Instruct/On Pipeline for a Flexible Packet Processor

[0595] Referring now to FIG. 44, an exemplary network processor pipeline460 is illustrated. According to one embodiment of the invention, thenetwork processor pipeline 460 consists of five stages: fetch, decode,address, execute and write. The network processor pipeline 460 enables astandard design flow and standard memories. The network processor canperform an instruction together with a data load or store from/to aunified internal memory in each cycle. The network processor pipeline460 enables an arithmetic instruction to use as its source operands datathat was loaded by the previous instruction without any bubble.Conditional jump and branch instructions have no penalty when thecondition is not taken while a penalty of 2 cycles occurs if thecondition is taken and there is a change of flow. To reduce thispenalty, delayed jump and branch instructions are provided. In additionto the data ALU there is an address ALU to enable efficient pointercalculation on data access. The network processor general purposeregisters (r0-r31) are updated during the write stage withoutdistinction as to whether they are updated from a load operation or froma data ALU operation.

[0596] Pipeline Stages

[0597] There are five pipeline stages: Fetch; Decode; Address; Execute;and Write.

[0598] The Fetch Stage

[0599] During the fetch stage, the network processor core places thenext instruction fetch address. This next fetch address can originatefrom the Program Counter (PC) in the normal sequential flow or can comefrom the address ALU when there is a jump or branch instruction. A32-bit new fetched instruction is assumed to be ready during the nextclock cycle after a specific access time from the specific internalmemory. Since the network processor internal SRAM is unified for bothdata and programs, and since it should support 64-bit access for data,the network processor initiates a fetch of 2 instructions (64 bits). TheFetch Unit (FTU) contains a fetch buffer to hold fetched instructionsthat were still not processed.

[0600] The Decode Stage

[0601] At the decode stage, the new instruction fetch is complete andthe decoding of the new instruction is performed. The decode logicdetermine the type of the incoming instruction and the operations thatshould be performed at each pipeline stage for the execution of theinstruction.

[0602] The Address Stage

[0603] During the address stage the data address for a load from memoryor for a store to memory is calculated by the address ALU. The addressALU get its source operands, which can originate from one or two of theGPR registers, an immediate address offset or an absolute address. Injump or branch instructions, the destination address is also calculatedby the address ALU. One of the address ALU inputs is the PC itself forbranch address calculation. After address calculation is performed, thecore places the new data address on the Data Address Bus (DAB) or thenew program address (for change of flow) on the Fetch Address Bus (FAB).If the instruction is a store, data to be stored into memory is placedon the Store Data Bus (SDB) during this stage.

[0604] The Execute Stage

[0605] The data ALU execution is done at the execute stage. Sourceoperands are read from the register file to the Data ALU, and dataarithmetic is performed. For example, if the instruction is an ADD of r1with r2, then r1 and r2 are mux-ed into the data ALU and arithmeticaddition is performed during the execute stage. Condition Codes (CC) arealso calculated at this stage. By the end of the execute stage, the dataarithmetic execution result together with the CC are ready.

[0606] The Write Stage

[0607] At the write cycle, the register file is updated. The update cancome from various sources: a destination of an arithmetic result, loadeddata from memory, a move from a Special Purpose Register (SPR), or amove of an immediate value into the register file. In case of a jump orbranch to a subroutine, the PC is also latched into one of the two LINKregisters inside of the register file. The CC register is also updatedat this stage.

[0608] Restricted Sequences

[0609] The network processor pipeline is designed to enable a standarddesign flow with standard memory interfaces. It is a five stage pipelinewhich is optimized for sequences that are frequently used and sequencesthat have a large effect on performance. By optimizing some of thesequences, there may be other sequences that might be problematic. Thesemay be solved by inserting software restrictions. Table 5 below listssome of the sequence restrictions according to one embodiment of theinvention. TABLE 5 No. Sequence Restriction Description 1 Registerupdate followed by Any instruction which updates an r register (forexample: move a store instructions, ALU instructions, load instructions,etc.) may not be followed immediately by a store instruction of thatsame r register. This includes instructions that update CC flags in r1following by a store of r1. 2 Register update followed by Anyinstruction which updates an r register (for example. move a use of thisregister as a instructions, ALU instructions, load instructions. etc.)may not be followed memory pointer immediately by an instruction whichuses that same r register as a memory pointer or as a source for amemory pointer calculation. Instructions that might use an r register asa pointer include: load, store, jump, branch, yield, and case. Thisincludes instructions that update CC flags in r1 followed by aninstruction that use r1 as a memory pointer. 3 Register update followedby Any instruction which updates an r register (for example: move a useof this register by instructions, ALU instructions, load instructions,etc) may not be followed AGENT WRITE immediately by AGENT WRITEinstructions or DMA instructions which instructions or by DMA use thatsame r register. instructions 4 Instructions inside a delay Change offlow instructions are not allowed in any kind of a delay slot. slotChange of flow instructions include:  Jump or Branch instructions  Yieldinstructions  Case instruction  RFT instruction  DMA instructions withthe yield option set 5 Instruction inside the delay The onlyinstructions that are allowed in a delay slot of a yield instructionslot of a “yield” are:  Store instructions  Agent Write instructions DMA instructions (only when the yield option is not set) 6 Change ofTrue sticky bit Any instruction which updates the conditional sticky bitmay not be before a conditional store or followed immediately by a:conditional agent write or  conditional store instruction. agent readinstruction  conditional agent write instruction.  conditional agentread instruction 7 SPRS to nrefetch SPR SPRS instruction with nrefetchSPR as its destination may not be followed followed by an RFTimmediately by an RFT instruction instruction r31 register updatefollowed Any instruction which updates the r31 register may not befollowed by a conditional change of immediately by a conditional changeof flow instruction which uses one of flow with one of r31 bits as r31bits as a condition a condition

[0610] Pipeline Timing Diagram

[0611] The pipeline timing and stages 480 are illustrated with referenceto FIG. 45. This diagram 480 together with the pipeline block diagram460 from FIG. 44 illustrates the basic flow through the pipeline stagesinside the network processor core. FIG. 45 starts with the update of theProgram Counter (PC) with the address of the next instruction. The FetchAddress Bus (FAB) gets its content from the PC and starts a memory fetchaccess. A new instruction is available on the Fetch Data Bus (FDB)during the decode cycle and passed directly to the decode logic. Theaddress ALU operates during the address stage and sends a new dataaddress to the data memory. If the operation is a load then the loadeddata is available on the Load Data Bus (LDB) during the execute stage.If the operation is a store then the stored data is placed on the StoreData Bus (SDB) during the address stage. The Data ALU gets its sourceoperands and executes the data arithmetic at the execute stage. By theend of the execute stage, data arithmetic result and the Condition Codes(CC) are ready to be latched into the destination register on the nextclock edge of the write cycle. If it is a load instruction then theloaded data is also latched into the destination register on thepositive clock edge of the write cycle. All register update operationsare going through the rf_in_mux and the actual update is on the writecycle. An update to any one of the Special Purpose Registers (SPRs) isalso done at the write stage.

[0612] An Internal Memory to be Used with the Flexible Packet Processor

[0613] Referring now to FIG. 46, an exemplary internal memory 500 forimplementation in the network processor (NP) is illustrated. Accordingto one aspect of the invention, the Vobla (network processor [NP])Memory (VMEM) 500 is a small and fast memory located near the networkprocessor NP core. The VMEM 500 serves the NP with three separate portsand the rest of the system with two ports. The main features of the VMEMaccording to one embodiment of the invention include: operates with theNP clock; supports multiple ports (e.g., five ports); maximum bandwidthof, for example, about 8 Gbytes/second (5 accesses×200 MHz×8 bytes); 64Kbytes of SRAM—first area between 0 to 48 KB and second area between 64to 80 KB.

[0614] SPAM Mapping and Priority

[0615] The SRAM, in one embodiment, is divided into three sub areas: 0to 8 K—data and tasks context; 8 to 48 K—data and program; and 64 to80K—program. The above 64 KB memory space can be accessed by the ringfor writes and by the multireader for reads. According to oneembodiment, the priority in each one of the memory areas is according tothe following rule: (1) ring interface—highest priority; (2) program;(3) data (load/store); (4) context; and (5) multi reader—lowestpriority.

[0616] Interfaces of the VMEM

[0617] The VMEM supports the NP by three ports: data (load/store),program, and context. The VMEM supports the ring interface and the NPcompound by two ports: multireader and ring writer.

[0618] Network Processor Program Bus (v_program)

[0619] This is a read port from the NP. Each access of this bus is foraligned double words (64 bits): 15 bits for Address bus, A(17:3). Thisallows access to 32K double words or 256 Kbytes. A(2:0) are don't carebits in this case and 64 bits data out bus.

[0620] Network Processor Data Port (v_data)

[0621] This is a read and write port from the NP. The data size can be abyte (8 bits), half-word (16 bits), word (32 bits), or double word (64bits). The access has to be aligned to the data size (half word on theboundary of half word, etc.). All the accesses are right aligned: bytein bits 0 to 7, half-word in bits 0 to 15, and word in bits 0 to 31. Aspecial data aligner for this port will arrange the incoming andoutcoming data according to the address and size transaction. Theinterface will generate the byte enable signals to the VMEM according toaddress bits A(2:0) and the size of the transaction, where: 16 bitsAddress bus—A(15:0)—Allows access to the first 64 Kbytes of the VMEMaddress space; A(2:0) and data size control enable signal; 48 Kbytes ofSRAM in current implementation; 64 bits data out bus for read access;and 64 bits data in bus for write access.

[0622] Network Processor Context Port (v_context)

[0623] This is a read and write port from the NP. The data size is aword (32 bits) for write access and a double word (64 bits) for readaccess. The interface will generate the byte enable signals to the VMEMaccording to address bit A(2). No data aligner is needed for thisinterface, where: 11 bits Address bus—A(12:2)—allows access to the first2K words (8 Kbytes) of the memory space—A(1:0) and A(15:3) are don'tcare bits in this case; 64 bits data out bus for read access; and 32bits data in bus for write access.

[0624] Multireader Port (v_mrd)

[0625] This is a read port from the multireader. The data size is adouble word (64 bits).

[0626] 13 bits Address bus—A(17:3). Allows access to all the VMEMaddress space.

[0627] A(2:0)—don't care.

[0628] 64 bits data out bus for read access.

[0629] Ring Interface Write Port (rif_i)

[0630] This is a write port from the ring interface. The data size canbe from 1 to 8 bytes and the data should be in a one aligned double wordso only one access to the memory is needed. The data is left aligned(big endian) and a special data aligner for this port will arrange theincoming data according to the VMEM address. The interface will generatethe byte enable signals to the VMEM according to address bits A(2:0) andthe size of the transaction, where 18 bits Address bus—A(17:0)—allowsaccess to all the VMEM address space; and 64 bits data in bus for writeaccess.

[0631] VMEM Micro Architecture

[0632] Basic SRAM Module

[0633] According to one approach, the VMEM uses two kinds of SRAMmodules: a single port SRAM organized as 512 words of 64 bits (4 KB) anda single port SRAM organized as 2048 words of 64 bits (16 KB). Each SRAMgets 8 Byte Enable (BEs) control signals.

[0634] SRAM Memory Array

[0635] The SRAM array is divided into 13 SRAM modules and the overallsize is 64 Kbytes. The first group is between 0 to 48K bytes. In term ofaddress space, each pair of SRAMs occupies 8 Kbytes. The odd SRAMcontains the first, third 8 bytes, etc. (0-7, 16-23, etc.), while theeven SRAM will contains the second, fourth 8 bytes, etc. (8-15, 24-31,etc.). The second group is between 64 to 80K bytes. This group include asingle 16K byte SRAM.

[0636] VMEM Control

[0637] The control is responsible for supporting the SRAM macros withaddresses and data, and for routing the data from the SRAMs to the rightbus. A contention occurs when there are two or more accesses to the sameSRAM macro. In that case, a priority mechanism is needed for avoidingstarvation. The VMEM sends a stall signal and the delayed transaction iskept by the VMEM until receiving service. The write access from the ringInterface port has the highest priority.

[0638] Restrictions. Any access to an unimplemented memory will respondwith garbage information without a special notification to the system.Any access that crosses the eight byte boundary of the SRAM macro (i.e.,a transaction to address 12 and size of 8) is invalid and the result isunpredictable and without an error notification.

[0639] Data In Path

[0640] Data In aligners. There are two data aligners in the Data InPath:Data aligner for the NP Data bus. The input to the data aligner isaligned to the right with a size of 1, 2, 4 and 8 bytes.

[0641] Data aligner for the Ring write bus. The input to the dataaligner is aligned to the left (big endian) with a length of 1 to 8bytes which is part of a one double word (64 bits) entry in the SRAM.

[0642] Data In buffers. There are two 64-bits data buffers for storingthe incoming data from the NP data bus and NP context bus in case of acontention in the VMEM. Since the ring write bus has the highestpriority it does not need a buffer.

[0643] Address in Path

[0644] Address In buffers. There are four 16-bit address buffers forstoring the incoming address from the NP data address bus, NP contextaddress bus, NP program address bus, and the multireader address bus incase of contention in the VMEM. Since the ring interface has the highestpriority it does not need a buffer.

[0645] Address In Muxes. There is a 4 to 1 mux (multiplexer) for each ofthe SRAM macros. The first two ports of all muxes are connected to theports: ring write address and multireader address.

[0646] There are a two options for the third port: NP Context addressport—connects to the two muxes that support the two SRAM macrosoccupying address 0 to 8K bytes; and NP Program address bus—connects tothe ten muxes that support the ten SRAM macros in address 8K to 48Kbytes. The NP data address bus is connected to the 12 address in muxes(the last SRAM is not connected to the data bus).

[0647] Data Out Path

[0648] Data Out Muxes. There are four data out muxes of 64 bits. A 13 to1 mux for the multireader data out bus. This mux is connected to the 13SRAM macros that reside in address 0 to 48K bytes and 64 to 80K bytes. A12 to 1 mux for the NP data out bus. This mux is connected to the 12SRAM macros that reside in address 0 to 48K bytes. A 11 to 1 mux for theNP program data out bus. This mux is connected to 10 SRAM macros thatreside in address 8K to 48K bytes and to the one SRAM macro that residesin address 64K to 80K bytes. A 2 to 1 mux for the NP context data outbus. This mux is connected to 2 SRAM macros that reside in address 0 to8K bytes.

[0649] Data Out aligner. There is a data aligner for the NP data outbus. The output of this aligner is right aligned according to the accesssize (1, 2, 4 and 8 bytes) and the access address.

[0650] The Core of the Flexible Packet Processor and AssociatedCompounds (Acents AND Non-Agents)

[0651] A block diagram of the network processor core according to oneembodiment of the invention was provided in FIG. 37. The networkprocessor compounds are those modules of the ring network implemented bythe network processor that are tightly connected to the networkprocessor core. Network processor compounds share a single ringinterface and address space with the network processor core. In otherwords, according to one embodiment of the invention incorporating thenetwork processor into a SOC using rings-type architecture, the networkprocessor core and the network processor compounds are all elements of asingle ring member.

[0652] Network processor compounds include agents and non-agents. Agentsare programmed by network processor commands through the networkprocessor agent interface, discussed below. Non-agents are programmed byinternal agents or through the ring interface by external members.

[0653]FIG. 47 is a schematic diagram of the network processor 500according to an embodiment of the invention. FIG. 47 illustrates thering interface 512 (dotted box at the bottom) and the network processor,which includes the network processor core 514 and the various compounds.The compounds include agents such as the doorbell agent 516, CRC/snoopagent 520, multireader agent 524, timer agent 526, message_sender agent528, and DMA agent 530.

[0654] Multireader Agent 524

[0655] The multireader module is an engine that serves requests to readportions of data from the network processor memory and sends thereceived data back to the destination. In one embodiment of the networkprocessor, the destination is most likely to be located external to thenetwork processor compound (the only internal modules that might usethis data are the CRC snooper or the memory in a mode when portions ofthe memory are copied from one location to another location). Themultireader is connected to the ring write interface, and to the agentinterface, from which it could get requests to read data from thememory.

[0656] Operation

[0657] The multireader agent and the network processor memory share thesame address space. Hence the multireader responds only to messages ofwork read type. The memory will respond only to messages of work writetype.

[0658] According to one aspect of the invention, the multireader can getrequests for data from the following modules: 1) local network processor(via the agent interface); 2) the three DMA controllers; 3) remote(external to the compound) network processor; and 4) the host (PP).

[0659] All the external requests for memory reads are stored in requestFIFO. The local network processor requests are stored in a specialrequest entry. There are two reasons why two different queues are usedfor the requests. The first reason is to have the ability to stall thelocal network processor if it asks for a new multiread request beforethe previous one was served. The second reason is to have the ability toknow when the local network processor multiread was finished. Thesefeatures can not be implemented by hardware for the other requestsources, since the other requests sources are generated by membersconnected to the ring.

[0660] The network processor request entry is written from the agentinterface and the request FIFO is written from the ring. All therequests are stored in the request entry or FIFO until they areserviced.

[0661] The order of serving the multiread requests is as follows: If thenetwork processor entry has a valid multiread request, it will servedbefore any other request in the request FIFO. If the network processorrequest entry is empty other requests will be served onfirst-in-first-out basis.

[0662] The multireader, in one embodiment, has the ability to stall datasent to the ring. A stall of data delivery could occur if the outputFIFO of the ring is full, or there is a higher priority message thatshould be sent to the ring (for example DMA, message sender messages).

[0663] The multireader request FIFO preferably is 8 entries deep (whichshould be sufficient to avoid the overrun case). FIG. 48 is a schematicdiagram of the multireader agent 524 according to an embodiment of theinvention.

[0664] Data Packing and Alignment

[0665] The network processor memory, in one embodiment, uses a 64-bitdata port. The multireader wants to take advantage of this fact so everymemory read will be of eight bytes. In this system there is a need toallow byte size data transfers over the ring from any memory location toany destination address.

[0666] The data that is read from the memory and sent on the ring in aring is aligned to the left (MSB [most significant bit] of the message)because big endian byte orientation is used. Because of thoserequirements there is a need to add an aligner in the multireader.

[0667] Another goal is to minimize data transfers over the ring andenable straight forward writing to FIFOs. This goal is satisfied usingdata packing logic which means that all the transferred messages exceptthe last one will contain 8 valid bytes. The last message might containless than 8 bytes, in which case the message type will indicate how manyvalid bytes there are.

[0668] The alignment and packing is done in the following manner. FIG.49 describes the data alignment 550 in case the last message contains 8,7 . . . , 1 valid bytes, when reading from an aligned address. It shouldbe noted that when data is written to memory, the opposite alignmentshould be performed. For example, consider the following scenario:reading 10 bytes starting at address=5. The multireader will send thefollowing data in the messages (X in the data part of the message meansthat this byte is don't care).

[0669] Multireader—Memory Interface

[0670] The multireader starts to issue memory read cycles if there is atleast one multiread request pending in the multireader request FIFO orrequest entry. Every read cycle that the multireader issues to thememory is a 8 byte request (in order to reduce the number of requests).The memory read cycle starts when the multireader generates the addressand read strobe for the memory. The memory detects this request and, ifnot busy with other requests, it drives the data to the multireader onthe following cycle. If the memory is busy and can not drive the data tothe multireader, it stalls the multireader. The multireader waits forthe data from the memory as long as the stall signal is asserted.

[0671] It is desirable that the originator of the multiread request willhave the ability to know that the multiread operation is complete. Ifthe originator of the multireader request is the local networkprocessor, it will have the ability to know if the multiread operationhad finished. The multireader will send the network processor a signalindicating that the multireader did not finish the multireader transferof the local network processor. The multireader busy indication will beasserted when the multiread request is registered in the networkprocessor entry and negated after the last message containing data ofthis request is sent to the ring.

[0672] For other originators of multiread requests (like the remotenetwork processor or PP), the indication of multiread transfer end iscontrolled by software. The software control is achieved by preparing aspecial data word at the end of the transferred block. The destinationof the multiread operation snoops this data. When this data is detectedthe multiread operation is finished. Note that only one transfer can beactive during the time of the snoop (otherwise it will not be possibleto detect which operation is finished).

[0673] Sending a message with first/last data in frame indication.

[0674] The multireader looks in the type field of the incoming message(multiread request) or in the options bits of the network processormultiread request, and, if the bit F is set, the first message in themultiread process will be sent with a destination address whichindicates the first byte in the frame.

[0675] The multireader also looks in the type field of the incomingmessage or in the options bits of the network processor multireadrequest, and, if the bit L is set, the last message in the multireadprocess will be sent with a destination address which indicates the lastbyte in the frame. (Every FIFO in the system should have three addresseswhich when writing to it indicates first, last data in the frame). TheMultireader will modify bits 2,3 of the destination address according tothe F,S bits.

[0676] Calculating CRC of Message Data In case there is a need tocalculate the CRC of the message data, the multiread request must setthe S option bit. This bit will cause the multireader to send all themessages with the type in which the S (snoop) bit set. The CRC machinewill snoop those messages and calculate the data CRC. Since the CRCmachine is a 32-bit machine and the message data is 64 bits wide, theCRC machine should have ability to stall the multireader from sendingdata to the ring when the CRC calculation on the data has beencompleted.

[0677] Multireader Input and Output Message Formats

[0678] A general multireader message will have the following format, asset forth in Table 6, for multireader input and output message format.TABLE 6 Field Description type[7:0] The type field describes theincoming message type. The following types are valid: type[7:0] =00000XXX: idle type[7:0] = 010XXLFI: work read. address[23:0] The fielddescribes the starting address for reading data from Vobla memory.data[31:0] This field contains information required for generating theoutput message and the operation of the multireader. data[23:0] =Destination address of the data. data[31:24] = The number of bytes toread from the Vobla memory. (if data[31.0] is zero the multireader reads256 bytes.)

[0679] Table 7 illustrates the multireader output message format for themultireader sending data to the rings. It should be noted that themultireader input message type is always a read type, and the outputmessage is always a work_write type. TABLE 7 Field Description type[7:0]The type field describes the outgoing message type. The following typesare valid: type[7.0] = 00000XXX: idle type[7:0] = 100FLZZZ: work write.address[23:0] The address of the destination. This information is basedon what was extracted from the input message data field, and the optionbits of the message type (L/F/I). data[63:0] data[63:0] - This fieldcontains the data that was read by the multireader.

[0680] Network Processor Multiread Request Format

[0681] When the network processor initiates a multiread request, it hasto write to the network processor entry in the multireader. FIG. 50describes how the multireader maps the data on the agent bus 556 to themultireader operation 558. The options are:

[0682] L—indication of last multireader request in frame (L=I last).

[0683] F—indication of first multireader request in frame (F=I first).

[0684] S—snoop indication for the CRC snooper (S=I snoop this message).

[0685] I—increment destination address, after every multiread transfer.

[0686] If the network processor sends new multiread requests while themultireader is busy serving previous requests those requests will stallnetwork processor. (Note: If count value is zero the multireader reads256 bytes from the memory.)

[0687] Requests Serving Priority

[0688] According to one approach, if there are more than one multireadrequest pending, the priority of serving them will be: (1) serving localnetwork processor requests if there are pending requests; and (2)serving all other requests on a FIFO basis.

[0689] Multireader Operation Scenarios—Examples.

[0690] Example A—Sending data to serial transmit FIFO: (1) The serialsends a request to fill its transmit FIFO.

[0691] (2) The request is registered in the doorbell logic. When thisrequest is serviced, the network processor sends an agent write commandto the multireader asking for data transfer.

[0692] (3) The multireader decodes the message (or the agent command)and initializes its operation.

[0693] (4) The multireader initiates memory read cycles and data fromthe memory is sent to the multireader.

[0694] (5) The multireader packs the data, generates the output message,and sends it to the ring if the ring is vacant. The destination is thetransmit FIFO in the peripheral.

[0695] (6) The process of reading data and sending it to the destinationrepeats itself until all the data transfer is complete.

[0696] Example B—Sending data to DMA write (transmit) buffer: (1) TheDMA controller issues a multireader message. This message asks for datatransfer from the memory to the DMA controller write buffer (The messagewill contain the destination address and the number of bytes that arerequired and the starting location in the network processor memory).

[0697] (2) The multireader decodes the message and initialize itsoperation.

[0698] (3) The multireader initiates memory read cycles and data fromthe memory is sent to the multireader.

[0699] (4) The multireader packs the data, generates the output message,and sends it to the ring if the ring is vacant. The destination is thewrite buffer in the DMA controller.

[0700] (S) The process repeats it self until all the data transfer iscompleted.

[0701] Software/Hardware Restrictions

[0702] According to one embodiment of the invention, the followingrestrictions may apply: do not activate more than one multireader at atime from each source (except the DMA, which can send two) in order notto cause overflow in the FIFO; and if the destination of the multireadrequest is one of the NP memories, only aligned transactions aresupported because the memory does not support overflow of memory entryduring a write (split one write command to two).

[0703] Message Sender Agent 528

[0704] The message sender agent 528 is a module which translates anetwork processor AGENT command to a message to be sent to a destinationon the ring. The message sender is connected to the network processoragent interface. The message sender is a powerful module since it cangenerate messages in all the different messages types that are availablein the system. This means that the network processor can send messagesto all the modules that are connected to the ring, and even replace thehost in sending supervisor messages. This feature can be very beneficialwhile debugging the system. The block diagram of the message sender 528is shown as FIG. 51.

[0705] There are three instructions dedicated for agent commands:AGENTW, AGENTWI, and AGENTR. The message sender ignores the AGENTRcommand. The AGENTW/I commands drive the value of three registers, ortwo registers and an immediate value, on the agent bus. Those registersare marked RA, RAP, and RB (or imm8). The message sender will interpretthe content of those registers in the following way (shown in FIG. 52):

[0706] Mapping for the AGENTW command is as follows:

[0707] RAP[23:0]—The destination address or the 32 LS (leastsignificant) bits of the data. This is a 24-bit address of a module(destination) that is connected to the ring or the 4 LS bytes of thedata that is sent to the ring when using the 64-bit data mode.

[0708] RA[31:0]—The data that will be sent to the destination (typicallyin work read messages it will include the return address for the datathat was read from the module and the number of bytes to read).

[0709] RB[7:0]—The message type that will be sent to the destination(only the LSB of RB will be used). In a 64-bit data message RB is theaddress of the message destination.

[0710] The AGENTWI command drives the value of two registers, eight bitimmediate value (imm8) on the agent bus. The registers are marked RA andRAP. The message sender will use the content of those register in thefollowing way:

[0711] RAP[23:0], RA[31:0]—same as AGENTW command.

[0712] imm8—the message type that will be sent to the destination.

[0713] Note: If the AGENTWI command is used there is no possibility tosend a 64-bit data message. Both commands also drive option bits, whichare part of the AGENT opcode. Each module uses those bits in a differentway. The message sender will use 7 option bits. FIG. 52 illustrates amapping an agent write command 560 to a message 562.

[0714] If the network processor sends new requests for message sendingwhile the message sender is busy serving previous requests, thoserequests will stall network processor. The message sender will have aninternal queue of 2 entries so it can store 2 requests for sendingmessages before stalling the network processor.

[0715] Message Sender Output Message Types

[0716] Table 8 illustrates the message sender output message formataccording to an embodiment of the invention. TABLE 8 Field Descriptiontype[7:0] The type field describes the outgoing message type. Thefollowing types are valid. (see message type table for more details).type[7:0] = 00000XXX: idle type[7:0] = 11111NNN: supervisor. type[7:0] =010XXLFI: work read. type[7:0] = 100FLZZZ: work write. address[23:0] Theaddress of the destination. This is the content of RAP or RB accordingto the mode used (option bit 6). If option[6] is one the address istaken from RB data[63:0]/[31:0] The message data. The content of RA orRA and RAP according to the mode used (option bit 6). If option[6] isone RA, RAP are used.

[0717] Data alignment. The alignment of the message data is determinedaccording to the message size and type. The following Table 9 describesmessage data alignment. TABLE 9 output Operation mode data size messagetype (64/32) (in bytes) output message format work write 64 8 {RA[31:0],RAP[31:0]} work write 32 1,2,3,4 {RA[31:0],32'b0} {RA[31:0],32'b0}{RA[31:0],32'b0} {RA[31:0],32'b0} work write 32 8 {32'b0,RA[31:0]} workread don't care don't care {32'b0,RA[31:0]} supervisor don't care don'tcare {32'b0,RA[31:0]}

[0718] Sending a 64-bit Data Message

[0719] The message sender can send a 64-bit data message. Sending a 64bit message is done by setting option bit[6] of the AGENTW command toone (this option is not available for the AGENTWI command). If thisoption is used the message sender uses the content of RA,RAP as thesource for the raw data, and RB as the source for the raw address. Inthis mode the message type is always work write, with 8 valid databytes. There is no provision for sending less than 8 bytes.

[0720] Handling Data and Address Options

[0721] The message sender uses six option bits that are driven bynetwork processor in order to modify the value of the raw_data andraw_address. This feature is useful when the value in the registers areused as constants and are modified as required. For example, whenwriting to a FIFO the content of RAP will be the FIFO address, and whenthe system seeks to write the first in frame or last in frame locationsthe address will be modified using the option bits. Data modification isuseful when sending a doorbell request. The data for the doorbellrequest is only 3 bits. Hence the raw data can be modified to generatedata for the doorbell request. The address and data modification may beperformed as follows: (1) the content of RAP[4:2] or RB (in 64 bit datamode) is OR'd with the option[2:0] bits to generate the messagedestination address; and (2) if the value of options bits[5:3] is notzero, the content of RA[2:0] or RAP (in 64 bit data mode) is replacedwith options[5:3] bits to generate the message data. Address and datamodification are active regardless of the message sender operation mode.

[0722] Software/Hardware Restrictions

[0723] Software/hardware restrictions include the following in oneembodiment of the invention: (1) the 64-bit data mode is available onlywhen using AGENTW command; and (2) in 64 bit mode the message type isalways work write.

[0724] DMA Agent

[0725] In a system with multiple processors (e.g., a system on a chipwith multiple network processors) that can send DMA transfer requests toone of multiple DMA controllers in the system, one challenge is knowingwhether the DMA request can be serviced prior to issuing the request toa particular DMA controller. Otherwise, a DMA controller can beoverloaded with DMA requests that it can not service.

[0726] According to one beneficial aspect of the present invention, thischallenge is met by providing a DMA agent module as a peripheral to eachprocessor in the system. For the network processor (Vobla) describedherein, for example, such a DMA agent may be implemented as one of thetightly linked compounds on the overall network processor. In otherwords, the DMA agent is a compound that shares the same ring interfaceas the overall network processor existing as a ring member.

[0727] According to this approach, the DMA agent operates to control theDMA transfer requests that are sent by the processor as follows:

[0728] (1) Each DMA controller has a dynamic pool of tokens that the DMAcontrollers allocate for use by the DMA agents linked to the variousprocessors. In other words, each DMA controller has a pool of tokensthat the DMA controller can distribute among the various DMA agents.

[0729] (2) Each valid token allows a DMA agent to send one DMA requestto the DMA controller that owns the token. If there are no valid tokens,no DMA requests can be issued by the DMA agent and the processor willstall,.

[0730] (3) The DMA agent periodically queries the DMA controllers fortokens whenever the number of valid tokens in the DMA agent's pool isless than a number prespecified by software. The maximum number set bysoftware can change.

[0731] In sum, this approach avoids the scenario of the DMA agentissuing requests that can not be serviced because the maximum number ofrequests that can be sent does not exceed the number of tokens held bythe DMA agent.

[0732] The DMA agent module 530 (illustrated in FIG. 53) translatesnetwork processor DMA commands to ring messages used to initialize theDMA controller.

[0733] According to one embodiment of the invention, each networkprocessor has one DMA agent. Each DMA agent has the ability to controleach and every one of the DMA controllers that are available in thesystem, using the context table (e.g., in the implementation there are 3DMA controllers, and each DMA agent can control up to 4 DMAcontrollers). According to one approach, the fourth DMA controller isprovided for future system expansion.

[0734] The DMA agent is connected to the network processor agentinterface and to the ring write interface. The DMA agent registers canbe written by the host only via the write bus using ring messages. Thecontext table is initialized by the PP once, and it is not changedduring regular work. The token registers should be written only by theDMA controllers.

[0735] The Sources for Requests

[0736] The DMA agent can receive requests to initialize a DMA channelonly via the agent interface using special network processor DMAcommands. The DMA agent has a small request queue of two entries inorder to minimize the need to stall the network processor if the DMArequest could not be serviced (e.g., this could happen if for examplethere are no available tokens, or if the DMA is unable to send themessages to the DMA controller because the ring is busy).

[0737] Requests Priority. There are two priority levels for DMA requestsin the DMA controller. The lower priority level is regular and thehigher priority level is urgent. By default all DMA requests areregular. A DMA request can become urgent if the processor defines it asurgent. Requests that have urgent priority have the urg bit in themessage set, and will get a higher priority in the DMA controller queue.The DMA agent ignores the urg bit (it sends it on to the DMAcontroller), and serves the requests in the order they arrive.

[0738] DMA Agent Context Table

[0739] The DMA agent context table maps a network processor DMA commandto the actual request that will be sent to the DMA controller that wasselected. The actual request defines the parameters for the current DMAtransfer. The context table has four entries. The table entry to be usedis determined by a two bit pointer encoded out of the 4 MSB (mostsignificant bits) of the DRAM address in the DMA command. (The reasonthat 4 bits are used is because the DRAM address space is divided into16 parts and only 4 could be accessed by the DMA). The entry allocation,which is hard coded. The context table could be written using writemessages. The table should be initialized before starting any DMAaccess. The context table could be read using read messages.

[0740] ADDR=DMA−AGENT−BASE to DMA_AGENT_BASE+$F. Note: The maximumnumber of tokens which could be allocated for one channel is 15. Table10 provides a description of the DMA context table. TABLE 10 fielddescription address[13:0] The physical base address of the DMAcontroller to be used. visitor[2:0] The number of the request and maskbits to set for the current DMA transfer. This field is common to allthe contexts. max_tokens[3:0] This field describes the maximum number oftokens that could be used by this DMA channel.

[0741] DMA agent token control. In order to manage DMA transfers fromdifferent sources with different contexts, a free token transfer basedapproach is used. According to this approach, the DMA agent has a poolof tokens. The service of a DMA request can start only if there areavailable valid tokens allocated for this DMA channel in the DMA agent.If there are valid tokens, the processing of the DMA request can startas previously described. If there are no available tokens to execute theDMA request, it will be registered in the DMA agent queue, and will waitfor execution until the DMA agent gets a token from the DMA controller(note that of the DMA agent queue is full the request will stall thenetwork processor).

[0742] Token distribution is performed using messages. The DMA agentissues a request for a token to the DMA controller each time the numberof valid tokens is less than the maximum allowed tokens (which isspecified in the context table). The DMA controller sends the token backto the agent and marks this token as used in its token list. The DMAcontroller will free the token again when the DMA transfer is finished(i.e., before sending the message to the doorbell). If the DMAcontroller has no free tokens then it sends the DMA agent an invalidtoken (i.e., all the bits in the token response are zero).

[0743] The DMA controller sends the DMA agent a valid token to theaddress of the token that was used (the DMA agent sends this address inthe token request message). According to one embodiment of theinvention, each DMA controller has a pool of a maximum of 16 tokens foreach DMA channel. Of course, the number of tokens that is available foreach DMA controller is flexible and could change according to systemneeds. The DMA agent token registers contains the token numbers that theDMA controllers allocated for use (the valid tokens are marked bysetting the appropriate bit to one). The token registers can be writtenonly by the DMA controllers. There are four token registers in the DMAagent. Table 11 illustrates the DMA agent channel[i] token register.TABLE 11 ADDR=DMA_AGENT−BASE+$10--DMA_AGENT_BASE+$1F 20 19 18 17 16 1514 13 12 11 10 9 8 7 6 5 4 3 2 1 0 novt req token[15:0] 0 0 0

[0744] Table 12 provides a description of the DMA agent channel[i] tokenregister. TABLE 12 field description token[15:0] This field describeswhich tokens are valid and can be used for DMA transters: token[i] = 0token not valid. token[i] = 1 token is valid. req This field indicatesthat the DMA agent had issued a token replacement request but did notget a response: req = 0 no token request is pending. req = 1 tokenrequest is pending. novt[3.0] This field describes the number of validtokens that used by the DMA agent for this DMA channel.

[0745] When a DMA request is registered with the DMA agent, the DMAagent searches the appropriate token register to see if there are validtokens. If there are valid tokens, the DMA agent uses one of them (e.g.,the first one it finds) and marks that token as invalid. Then, the DMAagent starts the data transfer for channel initialization. The DMA agentalso sends the DMA controller a message to replace the used token with anew one (this will be work read type message). The indication that theDMA agent issued a token replacement request is made by setting the reqbit of the relevant token register. If the DMA controller has a freetoken available it will send it to the DMA agent, and the agent willreplace the used token with the new one (i.e., the request bit iscleared). If the DMA controller does not have a free token available, itwill send the DMA agent an invalid token (i.e., all the token bits arecleared and the req bit is cleared). The DMA agent issues a new tokenreplacement request after a maximum of 4 cycles.

[0746] Address Error Control

[0747] The DMA agent has the ability to recognize if the DMA transfer ismade to an illegal external address for each of the external DMAchannels. When the DMA agent identifies such an access, it sends aspecial error message to the PP, informing the PP of the illegal accessparameters.

[0748] Address error calculation is performed on the SDRAM addresswritten by the network processor using the DMA command. The SDRAMaddress is split into two parts. The first part is bits [31:28] of theaddress and the second part is bits [27:20] of the address. The addresserror logic compares the first part of the SDRAM address to each one ofthe values (0, 0x2, 0x4, 0xf), which correspond to the 4 MS bits of theSDRAM areas. If a match is not found, an address error occurs and aspecial error message is generated by the DMA agent. If there is amatch, the bits of the second part are compared according to aprogrammed mask to zero. If the result is not equal to zero an addresserror is generated, and an error message is sent.

[0749] Address error mask register. Four (one for each external channel)8-bit registers are used to store the mask values for address errorcomputation. The mask value will be used to mask the comparison of someof the bits in the second part of the SDRAM address (bits 27-20). If abit in the mask register is set, the corresponding SDRAM address bitwill not be compared in the address error calculation. The reset valueof the register is zero so as to enable the comparison of all 8 bits.Table 13 illustrates the DMA address error mask register[i]. TABLE 13ADDR=DMA_AGENT_BASE+$30-- DMA_AGENT_BASE+$3F 7 6 5 4 3 2 1 0 mask[7:0]

[0750] Table 14 provides a description of DMA address error maskregister. TABLE 14 field description mask[7:0] This field describeswhich bits of the SDRAM address are masked during the process of addresserror calculation. mask[i] = 0 the corresponding SDRAM address is notmasked. mask[i] = 1 the corresponding SDRAM address is masked.

[0751] (Note: There could be cases in which the DMA controller accessesan invalid external address that the address error logic does notdetect. For example, this could happen if the base address of thetransfer is in the real or normal range, but the address generated bythe DMA during the transfer overflows this range.) (Note: If the networkprocessor issues a DMA request to a channel that was not initialized[i.e., the corresponding context table entry was not initialized] andaddress error will occur.)

[0752] DMA Agent Input and Output Message Formats

[0753] The DMA agent input and output message format is now described. Ageneral DMA agent message will have the format as shown in Table 15.TABLE 15 field description type[7:0] The type field describes theincoming message type. The following types are valid. If the last bitsare X they are ignored: type[7:0] = 00000XXX: idle. type[7:0] =010WXLFI: work read. type[7:0] = 100FLZZZ: work write. address[23:0] Thefield describes the starting address space of the DMA agent. The DMAagent register address is from DMA_AGENT_BASE_ADD toDMA_AGENT_BASE_ADD+$1F. data[31:0] The data to be written to theregisters.

[0754] The DMA agent output message format encoding is shown in Table 16below. TABLE 16 field description type[7:0] The type field describes theoutgoing message types. If the last bits are X they are ignored:type[7:0] = 00000XXX: idle. type[7:0] = 11111101: error. type[7:0] =010WXLFI: work read. type[7:0] = 100FLZZZ: work write. address[23:0] Theaddress of the destination. This address is a function of the baseaddress written in the context table and the token number (see FIG. 33for more details). data[63:0] data[63:0] - This file contains the datafor the DMA controller.

[0755] DMA Controller Message Data According to one approach, the DMAagent will send the DMA controller two messages for each DMA transferthat was initiated by the network processor. The following tablesdescribes the data part of each message. Table 17 illustrates the DMAcontroller message number 1. TABLE 17 31 30 29 28 27 26 25 24 23 22 2120 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3  2  1  0address[23:0]=destination_addr1 type-10000000 rsrvddoorbell_address[23:0] rsrvd sram_address[23:0]

[0756] The first message that will be sent from the DMA agent to the DMAcontroller contains the return address for the DMA request doorbell andthe internal SRAM address. The doorbell and the SRAM address are 24 bitswide:

[0757] doorbell address[23:0]—the 24 bits of the doorbell register towhich the DMA controller should send the acknowledgement at the end ofthe transfer. The 6 LSB bits of this address are the task ID number atthe time the DMA command was initiated.

[0758] SRAM address[23:0]—24 bit address inside the internal SRAM (thisis a full ring address).

[0759] Table 18 illustrates the DMA controller message number 2. TABLE18 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 87 6 5 4 3  2  1  0 address[23:0]=destination_addr1 type-10000000dram_address[31:0] rsrvd count[7:0] rsrvd end vst [2:0] ack dir urg

[0760] The second message contains the external DRAM address and controlinformation for the DMA transfer. The control information includes:

[0761] urg—1 bit of urgent DMA request.

[0762] dir—1 bit of the transfer direction (SRAM to DRAM or DRAM toEXAM). This information is found in the DMA command.(dir=0 SRAM to DRAM;dir=1 DRAM to SRAM).

[0763] ack 1—the bit of doorbell acknowledgement enable. This bit willtell the DMA whether it should send a doorbell at the end of thetransfer. This information is found in the DMA command.

[0764] count[7:0]—8 bits of the transfer size. This information comesfrom the DMA command.

[0765] vst[2:0]—3 bits of visitor code. These bits indicate whichrequest bit the DMA controller should set in the doorbell requestregister.

[0766] end—endian mode bit. The endian bit is the LSB bit of the DMAagent ID. (end=0 big endian mode).

[0767] Token request and token reply messages. Tables 19 and 20illustrate a token request and token reply message, respectively. Thedata part of the token request contains the address in the tokenregister that should be written with a new token. TABLE 19 31 30 29 2827 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3  2 1  0 type=11111101 sdram_address[31:0] rsrvd doorbell_address[23:0]

[0768] TABLE 20 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 1312 11 10 9 8 7 6 5 4 3  2  1  0 address[23:0]=destination_addr3type=01000000 rsrvd token_register_address[23:0] rsrvd

[0769] DMA agent calculating the message destination address. Accordingto one approach, the messages that the DMA agent sends to the DMAcontroller are sent to three different destinations. The first two ofthese message destinations are:

[0770] DESTINATION_ADDRESSI={DMA_BASE_ADDRESS[13:0],0,1,token_number[3:0], 0,0,0,0}

[0771] DESTINATION_ADDRESS2={DMA_BASE_ADDRESS[13:0],0,1, token_number[3:0],1,0,0,0}.

[0772] The destination address of the token request is:

[0773] DESTINATION_ADDRESS3={DMA_BASE_ADDRESS[13:0,]10′b0}.

[0774] Error Message Format

[0775] Table 21 illustrates the error message format. TABLE 21 31 30 2928 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2  1  0 token_register_address[23:0] type=10000000 rsrvd token[15:0]rsrvd

[0776] (Note: The doorbell address is the address to which a doorbellshould have been sent at the end of the DMA transfer if an address errorhas not occurred. This address contains the task ID information in thesix LSB bits and the base address of the network processor from whichthe message error was sent in bits 23-6.)

[0777] Network Processor DMA Request Format

[0778] When the network processor initiates a DMA request, FIG. 54describes how the DMA agent maps the data on the agent bus 576 to theDMA request 578.

[0779] The options as shown in FIG. 54 are as follows:

[0780] D—direction of data transfer (D=0 SRAM to DRAM; D=1 DRAM toSRAM).

[0781] NA—no acknowledgement at the end of DMA transfer (NA=O sendacknowledgement; NA=I do not send acknowledgement). Setting this bitwill also cause NOT to set the DMA mask bit in the doorbell agent whenthe DMA agents sends the messages to the DMA controller.

[0782] A—set auto set bit in the doorbell mask register.

[0783] U—urgent DMA request.

[0784] M—Modify address. Setting this bit enables the modification ofthe SRAM address and the DRAM address.

[0785] L—long address mode. Use 24 bits of RA as the SRAM internaladdress (in the regular mode [L=0] only 16 bits are used and the 8 MSBof the ring base address are appended to the 16 bits of RA to form theinternal SRAM address).

[0786] The DMA agent will have two request entries for storing networkprocessor DMA requests. If both entries are full and the networkprocessor issues a new request, the network processor will be stalleduntil one of the requests is served.

[0787] Address Modification

[0788] One common operation in control code writing (such as forcontrolling the operation of the network processor of the instantinvention) is the calculation of the destination address for read/writeoperations (such as read/write commands for the Vobla networkprocessor). Destination addresses can be calculated, for example,according to several modes:

[0789] (1) Immediate addressing—the destination address is included inthe command and no calculations are required.

[0790] (2) Register A+Register B—the destination address is the sum ofthe values of Register A and Register B.

[0791] (3) Register+Offset—the destination address is the sum of thevalue of Register A and an immediate offset value.

[0792] Often one of the arguments of an address calculation is used topoint to the base address of a data structure and the other argument isused to point to an offset within the data structure. One difficulty isthat if the same data structure is to be accessed multiple times withdifferent offsets, or if different data structures are to be accessedusing the same offset, the address calculation must be performedrepeatedly (in the first case, computing a new offset each time; in thesecond case, computing a new base address each time). These redundantaddress calculations impose cycle costs and decrease overall efficiency.

[0793] Accordingly, one beneficial aspect of the present inventionprovides for adding a special address computation mode to the networkprocessor data structure access commands. When activated, this specialmode causes the destination address to be automatically computed using abase address, offset, and an address modifier.

[0794] According to one implementation, the destination address in thisspecial mode is computed as:

[0795] DEST_ADDRESS=BASE_ADDRESS+OFFSET+MODIFIER

[0796] Accordingly, according to one embodiment of this approach, ifagent option bit 9 (in one of the DMA commands) is set the DMA agentwill modify the value of the SRAM address and the DRAM address (thatwere written by the network processor) before sending the controlmessage to the DMA controller. Address modification is accomplished inthe following fashion. DRAM address bits 1,2,3 are OR'd with count bits2,3,4 (respectively), and SRAM address bits 1,2,3 are OR'd with countbits 5,6,7 (respectively). When address modification is used, the DMAtransfer size is limited to one of the four options listed in Table 22below. TABLE 22 count[1:0] transfer size 00  2 bytes 01  4 bytes 10  8bytes 11 16 bytes

[0797] An example of the special mode of addressing is instructive.Assume that a data structure located inside internal memory for acommunication processor including the Vobla starts at address X. Thesize of the structure is SIZE bytes. Further assume that we want to copya part of this structure starting at offset address X+OFFSET1 from X toan external data structure which starts at address Y starting at addressY+OFFSET2. Thus, the X and Y based addresses are stored in a register.According to the conventional approach, address computation is asfollows:

[0798] ADD1=X+OFFSET1

[0799] ADD2=Y+OFFSET2

[0800] DMA ADD1, ADD2, SIZE

[0801] This conventional approach takes at least 3 cycles to execute andconsumes 3 program memory locations. Using the special mode according tothe invention, the code using address modification will be only thisline:

[0802] DMA ADD[1], ADD[2], SIZE, OFFSET1, OFFSET[2]

[0803] This code takes 1 cycle to execute and consumes 1 program memorylocation, which, therefore, saves program space and increaseperformance.

[0804] In accordance with one embodiment of the present invention, amethod for performing address computation for a data structure addresscommand in a communications processor is provided. The method comprisesproviding a library of read commands and write commands for a networkprocessor in a rings based architecture, including an option bit in theread commands and write commands for an address calculation modificationmode, providing an agent module for forwarding read requests and writerequests to a DMA controller in response to requests including anaddress issued by the network processor, and modifying the value of theaddress when the option bit is set before forwarding the read requestsand write requests to the DMA controller. The method, in one embodiment,permits repeated accesses to an external data structure withoutrecomputing the destination address in its entirety each time.

[0805] Modifying the value of an address, in one embodiment, comprisesautomatically computing a destination address using a base address, anoffset, and an address modifier.

[0806] Further, modifying the value of an address, in one embodiment,allows computation of the destination address using a single readcommand or write command.

[0807] Doorbell Set Mask

[0808] The DMA agent is responsible for setting the DMA mask bit in thedoorbell agent each time a DMA command is issued. The DMA mask bit willbe set only if the NA bit is cleared (if acknowledgement is not neededfor the DMA transfer there is no need to set the mask). If the auto setoption bit is set and the NA bit is cleared, then two mask bits will beset at the same time in the doorbell. The index of the bit that shouldbe set is determined according to the visitor bits in the context table(the auto set code is fixed) DMA Agent Operation Scenario Examples

[0809] Example A—The network processor asks for write DMA access:

[0810] (1) The Host has to initialize the DMA context table with all ofthe channel configurations. This should be done once for all possibleconfigurations.

[0811] (2) The network processor issues a DMA command on the agent bus.

[0812] (3) The DMA agent registers the request in the request queue andextracts parameters.

[0813] (4) The DMA agent checks whether there is an available token fromthe DMA controller to start processing the request. If there is no tokenavailable the request waits in the queue for execution until there is anavailable token. If the request queue is also full, the networkprocessor will be stalled.

[0814] (5) Assuming there is an available token, the processing of therequest begins. The DMA agent sends the DMA controller two messagescontaining all the parameters of the transfer.

[0815] (6) Since this is a write request, the DMA controller issues amultireader message. The multireader message requests a data transferfrom the network processor memory to the DMA write buffer.

[0816] (7) When the DMA transfer is finished, the DMA controller sends amessage to the doorbell.

[0817] Example B—The network processor asks for read DMA access:

[0818] (1) The host has to initialize the DMA context table with all thechannel configurations. This should be done at one time for all thepossible configurations.

[0819] (2) The network processor issues a DMA command on the agent bus.

[0820] (3) The DMA agent registers the request in the request entriesand extracts parameters.

[0821] (4) The DMA agent checks whether there is an available token fromthe DMA controller to start processing the request. If there is no tokenavailable, the processing is stalled until there will be an availabletoken.

[0822] (5) Assuming there is an available token, the processing of therequest begins. The DMA agent sends the DMA controller two messageswhich contain all the parameters of the transfer.

[0823] (6) When the transfer is finished, the DMA controller sends amessage to the doorbell. The DMA controller can now send a new token tothe DMA agent.

[0824] Software/Hardware restrictions. According to one embodiment ofthe invention, only the DMA controller can write to the token register.

[0825] In accordance with one embodiment of the present invention, acommunications processor implemented as on at least one ring network isprovided. The communications processor comprises a plurality ofprocessors comprising ring members on the at least one ring network anda plurality of DMA controllers on the at least one ring network, the DMAcontrollers controlling servicing of DMA requests by the plurality ofprocessors. The communications processor further comprises a pluralityof DMA agents coupled to the plurality of processors, each DMA agentbeing part of a ring member including a processor, wherein each DMAagent is adapted to service processor DMA requests by determiningwhether a valid token exists from a pool of tokens reflecting availableDMA controllers.

[0826] The tokens may be DMA controller specific tokens issued by theDMA controllers to the DMA agents to indicate when specific DMAcontroller access is available. Each time a processor issues a DMArequest, in one embodiment, the associated DMA agent determines whethera valid token exists and, if a valid token exists, services that DMArequest using the DMA controller associated with that token. The tokencan be marked as used or invalid when the token is used to service a DMArequest. If no valid token exists the DMA agent queues the DMA requestuntil a valid token exists. The associated DMA agent can be adapted toautomatically request a new valid token after an existing valid token isused to service the DMA request. Each DMA agent, in one embodiment, isadapted to request additional valid tokens when the number of validtokens in the pool falls below a maximum number. The processorscomprise, in one embodiment, a plurality of network processors and theat least one ring network comprises a plurality of ring networks.

[0827] In one embodiment, the pool of tokens is stored in a registerwritten to by the DMA controllers.

[0828] The DMA agents can be adapted to service processor DMA requestsby converting them to messages transmitted onto the at least one ringnetwork. Likewise, the DMA controllers can distribute valid tokens bytransmitting messages on the ring network that are received by specificDMA agents. Each DMA controller further may be adapted to maintain alist of tokens including those tokens that have been distributed asvalid tokens.

[0829] The DMA controllers can be adapted to respond to requests fromthe DMA agents for additional tokens with an invalid token when no validtokens are available. Each DMA controller can have a pool of up to, forexample, 16 tokens for each DMA channel. The DMA controllers, in oneembodiment, are capable of reading registers having the pools of tokensfor the DMA agents by issuing read messages traveling on the at leastone ring network.

[0830] CRC Agent (Snoop) 520

[0831]FIG. 55 is a schematic diagram of the CRC agent 520 according toone embodiment of the present invention. The Cyclic Redundancy Check(CRC) agent is a network processor compound module which implementslogic to perform CRC calculations. The CRC agent supports differenttypes of CRC calculations like CRC32, CRC16, CRC10, and so forth, fordifferent data sizes (1 to 8 bytes). According to one approach, the CRCagents works in two major operational modes. The first mode is a snoopmode and the second mode is on-demand mode. In the snoop mode the CRCagent snoops for messages in which the S bit is set. The CRC will detectthose messages and will calculate the selected CRC on the message data.The second mode of operation is the on-demand mode. In on-demand modethe network processor writes data to the CRC, and the CRC uses this datafor its calculations.

[0832] The network processor can write the CRC registers via the agentbus using AGENTW/I commands. The network processor can read the CRCresidue via the agent bus using an AGENTR command. The CRC agent canstall the network processor if the network processor reads the CRCresults and the results are not yet ready. The CRC module may also beable to generate a 32 bit random number.

[0833] Features of the CRC Agent

[0834] Performs CRC calculations of: CRC32 for ATM cell processing AAL5;and CRC10 for OAM ATM cells. This requires the support of: calculatingthe CRC10 on 22-bit data of the last transmit word; merging the 10-bitCRC into the 22-bit data to generate the last 32-bit word to betransmitted by the multireader; BIP 16 for ATM performancemonitoring—this process is done in parallel with the CRC calculation;CRC5 for ATM cell processing AAL2 (on-demand mode only); calculatingCRCS for 19-bit data for CRC generation (transmit)—(unless CRC5 is initby 0); calculating CRCS for 24-bit data for the CRC check (receive);checksum for IP streams. This will be done on 32-bit (or 64-bit) data.The convergence to 16-bit data will be performed by software.

[0835] The CRC Agent has two modes of operation:

[0836] On-demand mode, performed for any data transferred (e.g., CRC5,hashing function); and snoop mode, performed for a continuous datasequence transferred from/to the serial interfaces.

[0837] The CRC agent can be adapted to calculates CRC for 8, 16, 24 or32 bits of data in a single cycle. If CRC is enabled for snooping, anetwork processor agent read instruction from a CRC residue registerstalls until the last indication arrives with the last data word.Special control enables the CRC residue to be calculated on partial data(e.g. 22-bits in CRC10, or 0 bits in CRC32); then the CRC residue iscombined with the partial data to form the 32-bit last word of theframe, and this is exposed to the multireader block for transmission. InCRC5, the CRC module is capable of calculating the 5-bit CRC out of19-bit data for transmit, or out of 24-bit data for the CRC check inreceive (on-demand mode).

[0838] CRC Agent in one embodiment is adapted to interface to: transmitbus—for snooping TX data and calculating CRC; and agent bus—forconfiguration, on-demand activation and read/write residue.

[0839] Network Processor Writing to the CRC.

[0840] The network processor 514 can write to the CRC agent 520 usingAGENTW commands. The mapping of the AGENT command 590 to CRC data 592 isdescribed in FIG. 56.

[0841] The options include:

[0842] TYPE[2:0]—3 bit CRC. The types are: 000—CRC 32; 001—CRC 10;010—CRC 5; 011—checksum; 100—CRC16; 111—BIP16 (only for writing BIP16reside register).

[0843] The BIP 16 machine works in parallel to all of those machines.

[0844] SIZE[2:0]—The number of valid bytes in the data (1 to 8) startingat the LSB of RA (size=0 means 8 valid bytes in the message).

[0845] G—This bit indicates if the CRC agent works in the generate CRCor the check CRC mode.

[0846] S—The operation mode of CRC module. If S=1 the CRC works in thesnoop mode. If S=0 the CRC works in the on-demand mode. When working inon-demand mode, the data for the CRC calculation and the residue arewritten by the network processor. Since the data in the memory is storedin big endian format, and the data in the network processor registerfile is stored in little endian format, the CRC module may perform somemanipulation of the message data before the CRC calculation (especiallyif the data size is not 32 or 64 bit).

[0847] O—overwrite residue. If O=I the new residue from RB/imm8 is usedfor the CRC calculation. If O=0 then the current value of the residueregister is used.

[0848] CRC Residue Registers.

[0849] The CRC module contains two residue registers. The first residueregister is a 64 bit register containing the residue for the CRC andchecksum calculations. The second residue register is 32 bit registercontaining the residue for the BIP 16 calculation.

[0850] Reading CRC Registers by the Network Processor

[0851] The network processor can read the results of the CRCcalculations using the AGENTR command. The result of the CRC machinethat will be read is determined according to the operational mode thatwas selected.

[0852] The BIP16 machine calculation result will have a differentregister that could be read by the network processor (i.e., the tworesidue registers have two different addresses). If the networkprocessor reads one of the CRC registers and the result is not ready,the network processor will be stalled.

[0853] The CRC calculation is considered to be complete after all thedata had arrived (last indication in the message) in snoop mode. Inon-demand mode the result of the CRC calculation will be available forreading one cycle after it was written if the data size is smaller thanfour bytes, and two cycles after it was written for larger data sizes.

[0854] CRC Agent Operation Scenarios, Examples

[0855] Example A—calculating CRC in on-demand mode:

[0856] (1) The network processor writes the CRC agent using AGENTWcommand. The data that is written to the CRC agent contains: CRC type;the data on which the CRC is to be calculated, the size of the data(number of valid bytes), and a new residue if the current residue is tobe overwritten; the operational mode is set to work in the on-demandmode; and in the CRC 5 mode the G should also be written.

[0857] (2) One or two cycles after the data was written to the CRC(depending on the number of valid bytes in the data, the CRC machine cancalculate CRC on 32 bits in one cycle), the network processor can readthe CRC result.

[0858] Example B—calculating CRC on transmit data (multireader dataout):

[0859] The CRC machine can calculate the CRC of the transmit data bysnooping the S and L bits of the multireader output messages. Thenetwork processor initializes the CRC agent in the following manner:

[0860] (1) CRC type.

[0861] (2) A new residue if the current residue is to be overwritten.The data and the data size of the residue will be taken from the messagedata and type parts, respectively (the data part of the agent bus isignored in the snoop mode).

[0862] (3) The operational mode must be set to work in the snoop mode,selecting the transmit data bus as a source for the data.

[0863] (4) One or two cycles after the last data has arrived at the CRC(depending on the number of valid bytes in the data, the CRC machine cancalculate the CRC on 32 bits in one cycle) the network processor canread the CRC result.

[0864] Example C—calculating CRC of receive data: The CRC machine cancalculate the CRC of the receive data by snooping the S and L bits ofthe agent write bus messages. The network processor initializes the CRCagent as follows:

[0865] (1) CRC type.

[0866] (2) A new residue if the current residue is to be overwritten.The data and the data size will be taken from the message data and typeparts, respectively (the data of the agent bus is ignored in the snoopmode).

[0867] (3) The operational mode must be set to work in the snoop mode.

[0868] (4) One or two cycles after the last data has arrived at the CRC(depending on the number of valid bytes in the data, the CRC machine cancalculate CRC on 32 bits in one cycle) the network processor can readthe CRC result.

[0869] Timer Agent 526

[0870] Referring now to FIG. 57, an exemplary embodiment of the timeragent 526 is illustrated in accordance with one embodiment of thepresent invention. The timer module is designed to allow the assignmentof time stamps to various events within network processor tasks.According to one approach, the timer contains a 32 bit count-up freerunning counter. The counter counts at a frequency which could becalculated using the following formula.

F(counter)=[F(clock)]/[2*(prescale value+1)]

[0871] Usually the counter frequency will be set to 1 MHz (whichcorresponds to a 1 microsecond period). The prescale counter is a 10 bitdown-counter, which divides its input clock frequency by the prescalevalue. If the prescale value is equal to zero the prescaler will bebypassed.

[0872] The time stamp value could be read by the network processor fromthe time stamp register using the agent interface.

[0873] Network Processor Writes to the Timer

[0874] The network processor can write to the timer using theAGENTW/AGENTWI commands. In order to enable timer operation only twovalues are required. The first value is the control information whichresides in register RB or the imm8 value (according to one approach,only one bit is used). The second value is the prescale value whichdetermines the counting frequency of the timer. The prescale value istaken from the 10 LSB of RA. The value of RAP is ignored. FIG. 58illustrates the mapping of the AGENTW command 602 to the timer data 604.

[0875] Timer Control Register

[0876] The timer control register is used to store the prescale valueand to enable/disable the timer count operation. The timer controlregister is written using AGENTW/I commands and read using the AGENTRcommand. Tables 23 and 24 show the timer control register and adescription of the timer control register, respectively. TABLE 23 16 1514 13 12 11 10 9 8 7 6 5 4 3 2 1 0 ten RSRVD tps[9:0] reset = 0

[0877] TABLE 24 field description ten Timer enable bit. This bit enablesthe timer operation. tps[9:0] This field describes the division factorof the clock after it was divided by 2.

[0878] Time Stamp Register

[0879] The timestamp register contains the value of the timer counter atthe time of an agent read operation. The register is read by the networkprocessor using the AGENTR command. Table 25 illustrates the time stampregister, and Table 26 provides the time stamp register description.TABLE 25 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1110 9 8 7 6 5 4 3  2  1  0 tsv[31.0] reset = 0

[0880] TABLE 26 field description tsv[31:0] Timer stamp value. Thisvalue of the timer counter at the time of the read operation.

[0881] Doorbell Agent 516

[0882]FIG. 59 is a schematic diagram of the doorbell agent 516 accordingto one embodiment of the invention. The doorbell agent is the schedulermodule which handles the execution sequence of the tasks. The doorbellis connected to the network processor agent interface and to the ringwrite interface. The doorbell registers can be accessed by the networkprocessor using the one of the special AGENT commands, or via the writebus using ring messages (e.g., by the serials and the host). All thepossible service requests from the different sources go into thedoorbell agent via the write bus. When the doorbell detects a requestmessage it registers the request in the doorbell logic.

[0883] According to one embodiment of the invention, the doorbell agentcan handle requests of up to 64 different tasks. The doorbell choosesthe highest priority pending request (out of all the un-masked tasks),and sends its task ID to the network processor as the next task ID. Thenetwork processor sends back to the doorbell the current task ID that itis executing. The network processor uses the task ID information toperform the prefetch, bump and task switching, as previously described.

[0884] The Sources for Requests

[0885] The sources for doorbell requests include: Regular serial, timer,or software request: (e.g., a message from another task) This requestindicates that a data fragment had been received in the RX FIFO or thereis a place to write more data into the TX FIFO for transmission, or thata timer finished its count.

[0886] DMA request: The DMA had finished its data transfer.

[0887] Self-request: When a task yields itself (i.e., when the taskexecution time exceed the maximum allowed execution time), the softwarecan resume its execution by setting the self-request bit. The startingpoint of the task will depend on what is written in the EP (entry point)register. The EP register can be updated by hardware or by software.

[0888] According to one approach, every request bit has its own mask bit(except the self-request). When the mask bit is cleared the request isignored and the task can not trigger task switching. The self-requestconstitutes the only request bits that can not be masked. When a taskenters execution, its corresponding request bit and all the mask bitsare automatically cleared. (except the auto set [aset] and the urgentstatus bits [urg]). This is done to avoid serving the same request morethan once.

[0889] Selecting Next Task for Execution

[0890] According to one approach, the algorithm for selecting the nexttask for execution is as follows. The tasks which participate in theselection of the next task for execution are the tasks for which theircorresponding mask bit in the Task Global Mask Register (TGMR) iscleared. Tasks which participate in the selection of the next task andhave unmasked requests are divided into four groups and served in thefollowing order:

[0891] (1) Highest priority group include urgent requests of tasknumbers 0-31.

[0892] (2) Second priority group include regular requests of tasknumbers 0-31.

[0893] (3) Third priority group include urgent requests of task numbers32-63.

[0894] (4) Lowest priority group include regular requests of tasknumbers 32-63.

[0895] Within each group the requests are served according to the tasknumber. Lower task number requests are served before higher task numberrequests.

[0896] Accessing the Doorbell Registers from the Network Processor

[0897] The network processor can access the doorbell registers via theagent interface using one of special AGENT commands.

[0898] The network processor can directly modify only the register bitsof the current task (the request, mask, counter bits value), or theglobal mask register (TGMR). Modifying other task register bits can bedone via the ring write bus by sending a message from the message senderagent to the doorbell.

[0899] The data 612 for modifying the mask, request and the counter bits614 of the current task is encoded in the RB/imm8 part of the agentcommand as illustrated in FIG. 60. The doorbell logic decodes the 8 LSBof RB/imm8 and sets the appropriate bits in the current task register,counter, urgent or TGMR.

[0900] Setting a request or mask bit is performed by writing 5 bits ofthe command index in the RB/imm8 part of the AGENT command and then 3bits of the index or the request bit that is to be set, and then 3 bitsof the mask bit that is to be set. Note: Only one mask bit at a time canbe set by the network processor using a single agent command (if othermask bits were set they will be cleared by the agent write command,except for the autoset bit. Writing the auto set bit will not clearother mask bits). Writing to the request bits will not clear otherrequests bits if they were already set. If the index value is zero thewrite to that part of the register is ignored.

[0901] Table 27 describes the decoding of the RB/imm8 part of themessage and the operations that take place. TABLE 27 index operationRB/lmm8 value mask request Write task (0,0,0,0,0,mask_bit_index[2:0])000 don't change mask don't change request bits register(0,0,0,0,1,request_bit_index[2:0]) bits mask and 001 set the aset bitset the preq bit (self request don't change other request) Other requestbits mask bits bits are not changed. 011 set the mdma decrement the DMAbit clear all other request counter by 1 mask bits 100 set the mpreq setthe preq bit. Other bit.clear all other request bits are not mask bitschanged. write (1,0,0,0,0,counter value) request counter write(0,1,0,0,0,0,0,0) TGMR write (1,1,0,0,0,0,0,urgent value) urgent

[0902] The options in the agent command that are used by the doorbellare:

[0903] CM—Clear mask. Setting this bit will clear all the bits of thecurrent task mask bits (including the auto set bit).

[0904] CR—Clear request. Setting this bit will clear all the bits of thecurrent task request bits.

[0905] SG—Set global. This bit determines whether the task global maskregister (TGMR) bits will be set or cleared according to the data in theRA/RAP part of the agent command. If the SG bit is set then the TGMRbits will be set at the locations corresponding to the set bits inRA,RA+I data.

[0906] Clearing the mask registers bits is accomplished by writing 1 tothe clear mask (CM) bit in the command. If the CM option is used at thesame time another mask bit is written, the set operation overwrites theCM operation.

[0907] Clearing the requests bits is done by writing 1 to the clearrequests (CR) bit in the command. If clearing the requests happens atthe same time the request is set from the ring then the request will beset (set will overwrite the reset).

[0908] The peripheral request mask bit (mpreq) could also be set by thenetwork processor when there is a YIELD command and the set default maskoption is used. Other request bits will be cleared.

[0909] The network processor can initialize the DMA requests counter ofthe current task by setting the RB/imm8 part of the agent bus to{1,0,0,0,0,count_value[2:0]}.

[0910] Writing the TGMR is done by setting the RB/imm8 part of the agentbus to {0,1,0,0,0,0,0,0} (see also the discussion on the Task GlobalMask Register (TGMR) below.

[0911] Writing the current task priority bit is done by setting theRB/imm8 part of the agent bus to {1,1,0,0,0,0,0,urgent_value} (see alsothe discussion on Task Priority Control below).

[0912] Reading Doorbell Registers from the Network Processor.

[0913] The doorbell bits of the current task (i.e., the request bits,mask bits and the counter value) are reflected in the task SPR registerof the network processor. The TGMR could be read using the agent readcommand (AGENTR).

[0914] Setting the Doorbell Mask Bits from the DMA Agent

[0915] Another option for setting the DMA mask (mdma) bit and the autoset (aset) bit is by using the network processor DMA commands. The DMAcommands have an option to set the DMA mask bit and the auto set bit.

[0916] When the DMA agent detects a DMA command, it can set theappropriate mask bit in the doorbell using the DMA context table (thecontext table stores the information as to which bit to set). The masksetting will be done if the NA bit in the DMA command is cleared. Theauto set bit will be set if the A option bit in the DMA command is set.

[0917] Setting the Doorbell Requests Bits from the Ring

[0918] The doorbell registers could be accessed by the peripherals, thenetwork processor and the host using ring messages. Every time aperipheral wants to set a request bit, the peripheral sends a writemessage with a destination address of the doorbell entry it wants toset. The doorbell will set the appropriate request bit in the doorbellregisters according to the content in the data field of the message.

[0919] If a request bit and the corresponding mask bit are set, a validrequest is sent to the doorbell priority logic. The mask and auto setbits can not be modified from the ring write bus. Table 28 shows theencoding for the input message format. The doorbell responds to messagesfrom types mentioned in Table 28. TABLE 28 field description type[7:0]The type field describes the outgoing message types. (If the last bitsare X they are ignored). type[7:0] = 00000XXX: idle type[7:0] =100FLZZZ: work write address[23:0] The address of the doorbell register.The doorbell register space ranges from DOORBELL_BASE_ADD toDOORBELL_BASE_ADD + $3F. data[2:0] The value of the doorbell bit thatshould be set data[2:0] = 000 do not change any request bit. data[2:0] =001 set self request (sreq) bit. data[2:0] = 011 decrement requestcounter by 1. data[2:0] = 100 set peripheral request (preq) bit. PDoorbell request priority status. This bit reflects the current statusof the doorbell request. P = 0 Current request status is normal. P = 1Current request status is urgent. O Overwrite task current prioritystatus with doorbell request status. O = 0 current priority status isnot overwritten. O = 1 current priority status is overwritten.

[0920] Doorbell Register File Format

[0921] According to one embodiment, the doorbell register file contains64 registers. Thus, each possible task has its own doorbell register.The doorbell registers have the format set forth in Table 29. TABLE 2931-21 20 19 18 17 16 15-12 11 10 8 3-7 2 1 0 rsrvd urg rsrvd count[2:0]rsrvd preq dma sreq rsrvd mpreq mdma aset reset = 0 0 0 0 0 0 0 0 0 0 01 0 0

[0922] ADDR=DOORBELL_BASE to DOORBELL_BASE+$3F (Note: Current taskregister bits are reflected in the network processor status register.)(Note: All of the request and mask bits [not including the auto set bit]are automatically cleared when the task enters execution.) Table 30provides a description of the doorbell register according to anembodiment of the invention. TABLE 30 field description urg The urg(urgent) bit is used to allow the software to control the priority levelof a task (as opposed to the urgent request status which is beinggenerated automatically and could not be controlled by software). If thebit is set the task has high priority. This bit is written only by theVobla count[2:0] These bits represent the number of DMA requests thatshould be acknowledged. Every DMA activation that requiresacknowledgement at the end of the DMA transfer will cause the DMA agentto increment the counter value by 1. Every acknowledgement that iswritten to the dma bit in the doorbell register decrements the countervalue by 1. If the counter value is equal to zero and the current taskwas yielded, the dma bit will be set (only if the counter wasincremented at least once during the current task). If the dma mask(mdma) bit is set then a task switch will be triggered. Those bits canbe written by the Vobla using the AGENT command. preq Regular peripheralrequest. preq=0 no regular peripheral request is pending. preq=1 regularperipheral request is pending. This bit can be set from the write bus orby the Vobla, and can be cleared by Vobla. In case the bit is set andcleared at the same time, the set will overwrite the reset. dma This bitindicates that the request counter had decremented to zero after a validVobla yield. dma=0 the request counter did not decrement to zero. dma=1the request counter had decremented to zero. This bit can be set by thedoorbell logic. Writing to this bit from the write bus will decrementthe request counter value by 1. This bit can be cleared by the Vobla. Incase the bit is set and cleared at the same time, the set will overwritethe reset. sreq Self-request bit. This request is non-maskable. sreq=0self-request is not pending. sreq=1 self-request is pending. This bitcan be set from the write bus or the Vobla, and can be cleared by theVobla. In case the bit is set and cleared at the same time, the set willoverwrite the reset. mpreq Peripheral request mask bit. mpreq=0peripheral request is masked and can not trigger task switch. mpreq=1peripheral request is not masked, and will trigger task switch when itis the highest priority pending request. This bit can be set by theVobla and the DMA agent and can be cleared by the Vobla. In case the bitis set and cleared at the same time, the set will overwrite the reset.mdma DMA request mask bit. mdma=0 DMA request bit is masked and can nottrigger task switch. mdma=1 DMA request bit is not masked and willtrigger task switch when it is the highest priority pending request.This bit can be set by the Vobla and DMA agent, and can be cleared bythe Vobla. In case the bit is set and cleared at the same time, the setwill overwrite the reset. aset Automatically sets the mask bits to theirdefault value after serving the current request. aset=0 do not set themask bits to their default after serving the current request. aset=1 setthe mask bits to their default after serving the current request Thisbit can be set by the Vobla and DMA agent and can be cleared by theVobla. In case the bit is set and cleared at the same time, the set willoverwrite the reset. rsrvd Reserved bits are read as zero and can not bewritten.

[0923] TABLE 30 the biLis set and cleared as the same time, the set willoverwrite the reset. mpreq Peripheral request mask bit. mpreq = 0peripheral request is masked and can not trigger task switch. mpreq = 1peripheral request is not masked, and will trigger task switch when itis the highest priority pending request. This bit can be set by theVobla and the DMA agent and can he cleared by the Vobla. In case the bitis set and cleared at the same time, the set will overwrite the reset.mdma DMA request mask bit. mdma = 0 DMA request bit is masked and cannot trigger task switch mdma = 1 DMA request bit is not masked and willtrigger task switch when it is the highest priority pending request.This bit can be set by the Vobla and DMA agent. and can be cleared bythe Vobla. In case the bit is set and cleared at the same time, the setwill overwrite the reset. aset Automatically sets the mask hits to theirdefault value after serving the cunent request. aset = 0 do not set themask bits to their default after serving the current request. aset = 1set the mask bits to their default after serving the current request.This hit can be set by the Vohia and DMA agent and can be cleared by theVobla. In case the bit is set and cleared at the same time, the set willoverwrite the reset. rsrvd Reserved bits are read as zero and can not hewritten.

[0924] Task Global Mask Register (TGMR). The task global mask register(TGMR) is a 64 bit register (one bit per each task), which could beaccessed by the network processor using the AGENT commands. The TGMR isused to determine which tasks are taken into consideration whencalculating the next task for execution. Every set bit will prevent thecorresponding task from being selected as the next task for execution,even if that task has valid requests to serve (at least onecorresponding mask and request bits are set).

[0925] Writing the TGMR is done in the following way according to oneembodiment. The AGENT write command must contain the value 01000000 inthe LSB of RB or the imm8 field. Based on the value of the SG option bitand the value of RA,RAP, the TGMR bits are set or cleared. Only bitswhich have the corresponding RA,RAP bits set are affected.

[0926] The TGMR could be read using AGENTR commands. The 32 LSB of TGMRare located at address 0 of the doorbell, and the 32 MSB are located ataddress 1. The user can read all 64 bits using the read double option ofthe AGENTR command. If only 32 bits are read, the other part of the datawill be zeroed.

[0927] Handling DMA Requests

[0928] In a system with multiple processors capable of running multipletasks that can issue DMA requests to the multiple DMA controllers, onechallenge is knowing at certain points in time whether all of the DMArequests issued by a specific task running on a processor are finished.The challenge can be significant because DMA requests may be issued bydifferent tasks running on a processor to different DMA controllers.Also, the DMA requests may finish out of the order in which they wereissued.

[0929] According to one approach, the invention provides that a DMAagent (previously discussed) be associated with each of the processorsin the system. The role, in this instance, of the DMA agent is tocontrol the DMA transfer requests made by the associated processor. Foreach DMA request issued by the DMA agent the DMA agent sends anindication to a book-keeping unit. In one embodiment, the book-keepingunit is a request counter in the doorbell task register for eachprocessor. The book-keeping unit receives this indication and incrementsthe request counter. Because the preferred system performsmulti-tasking, the request counter may include a separate entry (orseparate request counter) for each task performed by the processor.

[0930] When the target DMA controller completes the DMA transfer, theDMA controller issues a decrement counter message to the book-keepingunit. The relevant entry (or relevant request counter) is thendecremented by one. When the relevant entry (ore relevant requestcounter) reaches zero, the system knows that all DMA transfers for thattask have been completed.

[0931] Therefore, according to one embodiment of the invention, duringnormal task execution, there is a possibility that more than one DMAtransfer is initiated. Each one of them could finish its data transferat any given time, perhaps not in the order in which they wereinitiated. Typically it is preferable to trigger a valid request onlyafter all DMA transfers from all the different DMA channels within atask have finished. In order to implement this requirement each doorbelltask register has its own request counter.

[0932] The request counter is incremented every time it gets anincrement counter indication. The increment counter indication is anoption in the network processor DMA commands (this is the NA bit). Everytime a DMA command is issued and NA bit is cleared, the counter isincremented by 1.

[0933] When the DMA controller or peripheral sends its acknowledgementback to the doorbell by writing to the DMA bit in the request register,the counter is decremented by 1. When the counter reaches zero and avalid YIELD was executed by the network processor, the DMA bit in thedoorbell register will be set. If the mdma bit is also set, a taskswitch request will be issued.

[0934] In accordance with one embodiment of the present invention, acommunications processor implemented as on at least one ring network isprovided. The communications processor comprises a plurality ofprocessors comprising ring members on the at least one ring network, aplurality of DMA controllers on the at least one ring network, the DMAcontrollers controlling servicing of DMA requests by the plurality ofprocessors, and a plurality of DMA agents coupled to the plurality ofprocessors. Furthermore, each DMA agent being part of a ring memberincluding a processor, wherein each DMA agent is adapted to issue anindicator to a request counter coupled to the DMA agent for each DMArequest issued by the DMA agent to a DMA controller, thereby allowingeach DMA agent to maintain a count of the outstanding DMA requests thathave been issued on behalf of the processor associated with the DMAagent. In one embodiment, the request counter maintains a separate countfor each task being executed by the processor, wherein the requestcounter is contained in a doorbell register supporting up to 64 tasks.

[0935] Upon satisfaction of the DMA request by a target DMA controller,the target DMA controller can be adapted to issue a response that causesthe request counter to decrement the count by one. In this case, the DMArequests issued by the DMA agent to the DMA controller and the responseissued by the target DMA controller can be transmitted as messages onthe at least one ring network. Also, upon the counter returning to zerothe processor can be enabled to switch to other tasks because all DMArequests for a given task have been satisfied. In this case a new DMArequest for a different task can be deferred until the counter hasreturned to zero for the given task.

[0936] In accordance with another embodiment of the present invention, amethod of controlling access to DMA controllers in a multi-taskingcommunications processor implemented as on at least one ring network isprovided. The method comprises issuing DMA requests to a target DMAcontroller, maintaining a count of DMA requests on a per-task basis, andissuing an acknowledgement that a DMA request has been satisfied by thetarget DMA controller. The method further comprises reducing the countbased on the acknowledgement and enabling a processor responsible forissuing the DMA requests to perform new activity when the count hasreturned to zero. In one embodiment, the DMA requests are issued asmessages on the at least one ring network. Similarly, theacknowledgement can be issued as a message on the at least one ringnetwork.

[0937] Auto Set

[0938] In order to increase performance (e.g., to eliminate the need toset the default mask at the end of every task), the auto setfunctionality is defined. When the aset (auto set) bit is set, the maskbits will be set to their default value after the desired request hasoccurred without triggering a request to the network processor and atask switch. The auto set bit can be written by the network processorusing the agent interface, or by using the DMA command (this is one ofthe options of the DMA command).

[0939] The default mask is: the peripheral request mask bit (mpreq) isset and all the other mask bits are cleared (see Table 28).

[0940] Task Priority Control

[0941] It is desirable to have the ability to control task prioritylevel in order to influence task scheduling. The doorbell modulesupports this requirement in two ways. The first way is software controlusing the urg bit in the doorbell task register (not the task SPR). Eachdoorbell task has an urgent priority bit in its task register (urg).When this bit is set the task becomes urgent and all of its requests areconsidered as urgent requests. The urgent bit remains set as long as itis not cleared by the network processor.

[0942] A second way to control the request priority level is by sendingmessages to the doorbell with the urgent status indicating the requestpriority level. If the overwrite current status is also set then therequest priority status bit in the doorbell is also updated. If the taskurgent status bit is set the task requests are also considered urgent.This bit is mainly controlled by hardware.

[0943] It should also be noted that the task priority is reflected inthe network processor status register.

[0944] Doorbell Operational Scenarios

[0945] Example A—Regular Serial Request:

[0946] (1) A serial sends a message with the destination address of itstask requests register in the doorbell register file. The data part ofthe message specifies which bit to set.

[0947] (2) If the corresponding mask bit for this task is set (this isthe default mask), then a valid request is sent to the doorbell prioritylogic.

[0948] (3) When this request becomes the highest priority pendingrequest, it can trigger the network processor task switch.

[0949] (4) The doorbell samples the task number of highest prioritypending request every time a yield is executed. If there are no pendingtasks the doorbell waits until the first time there is a pending task(except if the next task is the current task, in which case the networkprocessor waits until the yield indication, because there will be notask switch), and then samples the next task ID.

[0950] (5) After the next task ID is sampled by the network processor,the network processor performs the prefetch of the next task registers.

[0951] (6) The next task ID becomes current task ID.

[0952] (7) The doorbell logic clears the request bit and the maskregister of the task which caused the task switch.

[0953] (8) The doorbell calculates a new next task ID.

[0954] Example B—DMA Request:

[0955] The handling of a DMA request is very similar to the handling ofa serial request. The only difference is the process of setting the DMArequest and the mask bits. At the time DMA command is issued there is noinformation as to which request mask bit should be set. The doorbelllogic will get this information from the DMA agent. This will be doneusing the DMA context table and a special option in the Networkprocessor DMA command (the NA bit in the DMA command). When the DMArequest is registered with the DMA agent, the DMA agent will set the DMAmask bit in the doorbell register. The DMA agent will also tell the DMAcontroller which request bit it should send the acknowledgement when theDMA transfer is finished, in order to decrement the request counter.When the counter reaches zero and if the appropriate mask bit is set, avalid task switch request will be issued to the doorbell logic.

[0956] Example C—DMA Request with Auto Set:

[0957] When the auto set bit is set, the doorbell logic will set themask to the default mask value after the current task is finishedwithout asserting a request for task switching.

[0958] Software/Hardware Restrictions

[0959] According to one embodiment of the invention, the followingrestriction is imposed: Only eight pending DMA requests (DMA requeststhat were issued by the DMA agent for which acknowledgement has notreached the doorbell) per task are handled by the doorbell.

[0960] Network Processor Debug Module

[0961] According to one embodiment of the invention, the networkprocessor compound includes a debug module. The debug module supportsvarious breakpoints and enables program code patching. The debug modulecan be programmed through the ring interface. The debug module containstwo breakpoint channels and eight patch channels. Each one of the patchchannels can be configured to be used as a patch channel or as anadditional program address breakpoint channel. A single step programtrace is supported.

[0962] A Breakpoint Event and a Patch Event

[0963] The network processor core supports two kinds of program breaks:a breakpoint and a patch. A breakpoint event causes the program flow tojump to a program location pointed by a given vector and to enter thetrap mode of execution by setting the trap mode bit located in thenetwork processor task SPR. When in trap mode, no further breakpointwill be accepted. The trap mode bit will be cleared by executing an RFT(Return From Trap) instruction or by writing a zero to the trap modebit. When the trap bit is cleared, the network processor returns to thenormal execution mode where further breakpoints are accepted. A patchevent causes the program flow to jump to a program location pointed by agiven vector. In a patch event the trap mode bit will not be set, thusremaining in the normal execution mode. A patch event is useful forprogram patching of code written in ROM.

[0964] Patch Channels

[0965] According to one embodiment, there are eight patch channels. Eachof the patch channels can be configured to operate as a patch channel oras an additional program address breakpoint channel. If a patch channelis enabled and is configured as a patch, a patch event will occurwhenever there is a fetch from a program location equal to the catchaddress (discussed below). If a patch channel is enabled and isconfigured as a break, a breakpoint event will occur whenever there is afetch from a program location equal to the catch address. Each one ofthe patch channels will cause the network processor program to jump to adifferent vector location according to a vector table (see thediscussion on the vector table and Table 37 below).

[0966] Each of the patch channels includes a patch register as shown inTable 31. TABLE 31

[0967] Patch Register. This is a 32 bit read/write register (through thering). This register is cleared by a hardware reset:

[0968] Bits 15:0—Catch Address: This is the 16 bit program address whichcauses a patch event or a breakpoint event.

[0969] Bit 16—Break or Patch (B/P): When the B/P bit is cleared, thepatch channel operates as a patch channel. When the B/P is set, thepatch channel operates as an additional program address breakpointchannel.

[0970] Bit 17—EN: This is the channel enable bit. When EN is set, thechannel is enabled. When EN is cleared, the channel is disabled.

[0971] Bits 31-18—reserved. These bits are reserved. Reserved bits areread as zero.

[0972] Address Breakpoint Channels

[0973] According to one approach, the debug unit includes two addressbreakpoint channels. Address breakpoint channels can be configured tocause a breakpoint when there is a program or data memory access tospecific locations. Each of the address breakpoint channels isconfigured by its address register and by the address breakpoint controlregister.

[0974] Address Registers. Each of the two address breakpoint channelsinclude an Address Register. See Table 32 and Table 33, which show thechannel 0 address register and the channel 1 address register,respectively. These are 32 bit read/write registers which are cleared bya hardware reset. Bits 15:0 hold the break address and bits 31:16 holdthe break mask. The break address is the program location at which causea breakpoint event. A breakpoint event occurs only if the addressbreakpoint is enabled and there is a match between the memory addressaccessed and the break address. The break mask is used to specify whataddress bits to compare. For example, if all the mask bits are set thenthe address comparison will be done on all address bits. If, forexample, mask bit 0 is cleared and all the rest are set then thecomparison will riot include bit 0 of the address. This way, an addressbreakpoint can be generated not only on a specific address but also on awindow range of addresses. Table 34 shows the address breakpoint controlregister. TABLE 32

[0975] TABLE 33

[0976] TABLE 34

[0977] Address Breakpoint Control Register. The address breakpointcontrol register is a 32 bit read/write register. This register is usedto configure the operation of each one of the address breakpointchannels.

[0978] Bits 1.0—MODE0. These two bits specify for channel 0 on whichevent to cause an address breakpoint as specified in Table 35. Table 35illustrates the Address Mode (AMODE) corresponding to bits 1:0. TABLE 35Mode Breakpoint On 00 Program Fetch 01 Data Read 10 Data Write 11 DataRead or Write

[0979] Bit 2—Enable 0 (ENO): When EN0 is set, address breakpoint channel0 is enabled and can cause a breakpoint event. When this bit is cleared,address breakpoint channel 0 is disabled.

[0980] Bits 4:3—MODE1: These two bits specify for channel 1 on whichevent to cause an address breakpoint as specified in Table 35. Bit5—Enable 1 (EN1): When EN1 is set, address breakpoint channel 1 isenabled and can cause a breakpoint event. When this bit is cleared,address breakpoint channel 1 is disabled.

[0981] Debug Control Register

[0982] The Debug Control Register is a 32 bit read/write register. Thisregister is cleared by a hardware reset. Table 36 illustrates the debugcontrol register according to one embodiment of the invention. TABLE 36

[0983] Bits 10:0—Vector Base Address (VBA): The is the Vector BaseAddress. The VBA points to the starting location in memory of the vectortable. The vector table is a 32 word table explained further below.

[0984] Bits 16:11—Task ID (TID): The TID is the task ID on which tocause or not to cause a breakpoint. It is used by the task breakpointand can be used by the address breakpoints as explained by the followingcontrol bits.

[0985] Bit 19—TAND: When TAND is set, then an address breakpoint willoccur only if there is both an address match and the current task ID isequal to the TID. Note: When a patch channel is configured to operate asa program address breakpoint channel, it has the same rules as thededicated address channels and the TAND is treated the same.

[0986] Bit 20—TNOT: When TNOT is set, then an address breakpoint willoccur only if there is an address match and the current task ID isdifferent from the TID. Note: When a patch channel is configured tooperate as a program address breakpoint channel, it has the same rulesas the dedicated address channels and the TNOT is treated the same.

[0987] Bit 21—Enable Task Breakpoint (ENTB): This bit enables the taskID breakpoint. When ENTB is set, a task switch to a task ID which isequal to TID will cause a breakpoint event. When this bit is cleared,the task ID breakpoint is disabled. When setting the ENTB bit, thecurrent task ID is compared to the TID and. if equal, there will be abreakpoint. Further task ID breakpoints will occur only upon switchingto a new task which is equal to the TID.

[0988] Bit 22—Enable Yield Breakpoint (ENYB): This bit enables the yieldbreakpoint. When ENYB is set, any yield (task switch) will cause abreakpoint event. When this bit is cleared, the yield breakpoint isdisabled.

[0989] Bit 37—TRACE: When the TRACE bit is set, a breakpoint will occuron every new instruction execution, thus allowing a single stepinstruction trace. When the TRACE bit is cleared, trace is disabled.

[0990] The Vector Table

[0991] In case of a breakpoint event or a patch event, the debug modulesupplies the network processor core with a vector for where to jump. Thevector table is illustrated in Table 37. Each event has a differentvector that is calculated by taking the 11 bit VBA and concatenating toit a 5 bit offset. For example, assume that the 11 bit VBA is all zeros.In this case, the breakpoint vector will point to program address $2,patch 0 will point to $4, and so on. The increments are of 2 instructionspaces for each of the events. TABLE 37 Address For VBA + $0 Reservedfor reset VBA + $2 Breakpoint VBA + $4 Patch 0 VBA + $6 Patch 1 VBA + $8Patch 2 VBA + $A Patch 3 VBA + $C Patch 4 VBA + $E Patch 5 VBA + $10Patch 6 VBA + $12 Patch 7 VBA + $14 − VBA + $1F Reserved

[0992] Breakpoint Status Bits

[0993] According to one aspect of the invention, special status bitslocated in the processor Task SPR for reflecting the cause of thebreakpoint event. These bits are the PAB, DAB, TB and YB bits. The PABbit is for a program address breakpoint. The DAB bit is for a dataaddress breakpoint. The TB bit is for a task breakpoint. The YB bit isfor a yield breakpoint. These bits are set whenever the relevantbreakpoint occurs. These bits are cleared by the RFT instruction.

[0994] Agent Interface

[0995] According to one aspect of the invention, the agent interfaceconnects the processor to all of the agents in the compound. Thisinterface is used by the network processor to read and write data.

[0996] Signal Description

[0997] Table 38 provides the agent interface signal list. TABLE 38direction (relative signal name description to Vobla) remarksV_AGENT_RA[31:0] The content of Output register RA from the AGENTopcode. V_AGENT_RAP[31:0] The content of Output register RAP. RAP is theRA+1 register. V_AGENT_RB[31:0] The content of Output May be register RBfrom the reduced to 16 AGENT opcode, or bits. a 8 bit immediate value.V_AGENT_ID[4:0] Agent ID. The ID of Output the selected agent.V_AGENT_OPTIONS[9:0] Various options Output used by the agents.[module_prefix]_read_DATA[63:0] Data from the Input agents. V_AGENT_WRWrite to agent Output indication. V_AGENT_RD Read from agent Outputindication. V_AGENT_DOUBLE Load double from Output agent.

[0998] Agent ID Allocation

[0999] Table 39 below provides the agent ID allocation. TABLE 39 AgentName ID Number DMA agent 00000-0011 CRC 00100 Multireader 01000 Doorbell01001 Timer 01010 message sender 01011

[1000] Agent Register Mapping

[1001] Table 40 illustrates agent register mapping. TABLE 40 Agent NameRegister Name Register Address CRC crc_residue 0 bip16_residue 1Doorbell TGMR_L 0 TGMR_H 1 Timer timer_control 0 time_stamp 1

[1002] Network Processor Compound Memory Map

[1003] Table 41 illustrates the network processor compound memory mapaccording to an embodiment of the invention. TABLE 41 Name Addressdma_agent_context_table0 Vobla_compound_register_base + $0dma_agent_context_table1 Vobla_compound_register_base + $2dma_agent_context_table2 Vobla_compound_register_base + $4dma_agent_context_table3 Vobla_compound_register_base + $Fdma_token_register0 Vobla_compound_register_base + $10dma_token_register1 Vobla_compound_register_base + $12dma_token_register2 Vobla_compound_register_base + $14dma_token_register3 Vobla_compound_register_base + $1Fdma_address_error_mask_register0 Vobla_compound_register_base + $20dma_address_error_mask_register1 Vobla_compound_register_base + $22dma_address_error_mask_register2 Vobla_compound_register_base + $24dma_address_error_mask_register3 Vobla_compound_register_base + $2Fchannel0_address_register Vobla_compound_register_base+$30channel1_address_register Vobla_compound_register_base+$31address_breakpoint_control_register Vobla_compound_register_base+$38debug_control_register Vobla_compound_register_base+$39debug_petch_register0 Vobla_compound_register_base+$40debug_petch_register1 Vobla_compound_register_base+$41debug_petch_register2 Vobla_compound_register_base+$42debug_petch_register3 Vobla_compound_register_base+$43debug_petch_register4 Vobla_compound_register_base+$44debug_petch_register5 Vobla_compound_register_base+$45debug_petch_register6 Vobla_compound_register_base+$46debug_petch_register7 Vobla_compound_register_base+$47doorbell_request_register[63:0] Vobla_compound_register_base +$80Vobla_compound_register_base +$BF

[1004] Communications Processor Implementing a Ring Network

[1005] The inventive aspects of the ring network and/or the networkprocessor, as described above, find particular benefit when implementedin combination in a high-performance communications processor inaccordance with the present invention. The high performancecommunications processor (HPCP) of the invention may on occasion bereferred to as the Trajan. As will be evident from the following writtendescription, the HPCP may be implemented in various fashions withoutdeparting from the true spirit and scope of the invention. Just by wayof example, the number of DMA modules, the characteristics of thecontrol processor, the number of interfaces supported to ATM, the numberof flexible packet processors, may vary. Generally, the flexible packetprocessor of the present invention may on occasion be referred to hereinas the Vobla.

[1006] Generally, the HPCP should be capable of supporting a variety ofapplications in a range of markets. For example, the HPCP may be usedfor Customer Premises Equipment (CPE) applications, such as for DigitalSubscriber Line (DSL) services. DSL, sometimes generically referred toas xDSL, refers to the family of digital lines that carriers mayprovide, such as ADSL, HDSL, SDSL, and so forth. These technologies areall well understood in the art. DSL CPE applications for the HPCP mayinclude bridges for Ethernet and USB; DSL-Ethernet routers; DSL-homewireless routers; Voice Integrated Access Devices (IADs); and servicegateways. The HPCP may also be used for consumer networking equipment,such as home routers (Ethernet and/or wireless) and networked appliances(e.g., Universal Plug 'n Play [UPnP] devices). The HPCP may also be usedfor access network equipment applications, line card applications, andvoice processing applications (e.g., voice gateways). Generally, theHPCP will find beneficial application in any voice or communicationprocessing application.

[1007] In sum, the goal of the HPCP is to provide a PHY-neutralcommunications processor that can be readily integrated with appropriatePHY functionality (e.g., ADSL PHY, SHDSL PHY, xDL PHY, etc.) to supporta myriad of applications on a variety of network platforms based on asingle system on a chip (SOC) building block.

[1008] According to just one embodiment, the HPCP (e.g., the so-calledTrajan I) would have the baseline specifications set forth in Table 42below. Table 42 is offered solely for purposes of example and theinvention is in no way limited to this embodiment. In fact, it isanticipated that continuing advances in the processor art will result incontinually changing parameters. TABLE 42 Router/Bridge Shaped/ ClockThroughput Unshaped Speed Network Expansion Hardware ATM-Eth ATMProcessors (MHz) Interfaces Interfaces Accelerators (kpps) throughput 2× NP, 200 2 × Utopia External Cell/Packet 400 2 × OC-3 1 × MIPS 266(8/16 bit), Peripheral lookup; MMU 4 × Ethernet Bus (EPB) 3 × DMA IOMII/RMII (10/100) 256 time slots TDM I/f

[1009] The communications architecture employed by the HPCP could be aconventional bus-based architecture, a switch fabric type architecture,star-based architecture, or other architecture known in the art.Preferably, the HPCP employs the rings architecture and message basedprotocol of the present invention, discussed above, whereby each moduleof the HPCP occupies a position on a ring, as discussed below.

[1010] In accordance with one embodiment of the present invention, acommunications processing system utilizing a ring network architectureis provided. The communications processing system comprises a pluralityof ring members connected in point-in-point fashion along the ringnetwork, a transaction based connectivity for communicating at least onemessage among at least a portion of the ring members, wherein themessage includes information indicative of a destination ring member forwhich the message is intended and the message is passed around the ringnetwork until reaching the destination ring member, and wherein thedestination ring member is adapted to receive the message and remove itfrom the ring network. The communication processing system, in oneembodiment, is implemented on a single chip, while in other embodimentsthe system is implemented on more than one chip. The informationindicative of a destination ring member can comprises a ring memberidentifier and/or an address corresponding to the destination ringmember. In one embodiment, the ring network includes a bridge across thering network that allows messages to travel from one side to anotherside without passing through intermediate ring members.

[1011] The transaction based connectivity of the system may provide formessages to be passed around the ring network according to a clockingscheme. In one implementation, the clocking scheme provides for themessages to travel one ring member per clock cycle. Similarly, thetransaction based connectivity can provide for a plurality of messagesto travel the ring network, each message traveling one ring member perclock cycle unless a message is consumed at a given ring member.Likewise, the connectivity may provide for messages comprisingtransactions to travel the ring network, and wherein the messagescomprise one or more of a command, an instruction, a type, an address,and data.

[1012] In one embodiment, the message arriving at a non-destination ringmember will be passed to the next ring member on the ring network.Alternatively, the message arriving at a destination ring member will beconsumed by the destination ring member. In this case, the message canbe removed from the ring network while being consumed so that a slot onthe ring network is made available. The available slot may enable adownstream ring member to insert a message in the slot.

[1013] Furthermore, in one embodiment, each ring member receiving amessage is adapted to check a destination address portion of the messageto determine if the message is intended for that ring member, and if thedestination address portion corresponds to that ring member, the ringmember takes the message off of the ring network and consumes themessage.

[1014] In one embodiment, the at least one message comprises a messagethat causes ring members to assign address space during configuration ofthe ring network. This message may comprise an enumeration message. Theassignment of address space during configuration allows a processingring member to subsequently infer the configuration of the ring network.

[1015] In accordance with another embodiment of the present invention, acommunications processing system utilizing a ring network architectureis provided. The communications processing system comprises a pluralityof ring members having unique addresses and connected in apoint-in-point fashion along the ring network, a transaction basedconnectivity for communicating at least one message among at least aportion of the ring members, wherein the message includes a destinationring member address for which the message is intended and the message ispassed around the ring network until reaching the destination ringmember, and where the destination ring member being adapted to receivethe message and remove it from the ring network. The communicationprocessing system, in one embodiment, is implemented on a single chip,while in other embodiments the system is implemented on more than onechip. In one embodiment, the ring network includes a bridge across thering network that allows messages to travel from one side to anotherside without passing through intermediate ring members.

[1016] The transaction based connectivity of the system may provide formessages to be passed around the ring network according to a clockingscheme. In one implementation, the clocking scheme provides for themessages to travel one ring member per clock cycle. Similarly, thetransaction based connectivity can provide for a plurality of messagesto travel the ring network, each message traveling one ring member perclock cycle unless a message is consumed at a given ring member.Likewise, the connectivity may provide for messages comprisingtransactions to travel the ring network, and wherein the messagescomprise one or more of a command, an instruction, a type, an address,and data. The connective also may provide for messages comprisingtransactions to travel the ring network, and wherein the messagescomprise one or more of a command, an instruction, a type, an address,and data. The destination ring member address can comprise a startingaddress for the destination ring member and/or an address within theaddress space assigned for the destination ring member.

[1017] In one embodiment, the message arriving at a non-destination ringmember will be passed to the next ring member on the ring network orconsumed by the destination ring member. In one embodiment, each ringmember receiving a message checks the destination ring member address ofthe message to determine if the message is intended for that ringmember, and if the destination ring member address corresponds to thatring member, the ring member takes the message off of the ring networkand consumes the message. If consumed, the message can be removed fromthe ring network while being consumed so that a slot on the ring networkis made available. The available slot may enable a downstream ringmember to insert a message in the slot.

[1018] In one embodiment, the at least one message comprises a messagethat causes ring members to assign address space during configuration ofthe ring network. This message may comprise an enumeration message. Theassignment of address space during configuration allows a processingring member to subsequently infer the configuration of the ring network.

[1019] In accordance with yet another embodiment of the presentinvention, a communications processing system utilizing a ring networkis provided. The system comprises a plurality of ring members havingunique addresses and communicatively connected in a point-in-pointfashion along the ring network and a transaction based connectivity forcommunicating at least one message among at least a portion of the ringmembers, wherein the message is travels from a first ring member to asecond ring member based at least in part on an address assigned to thesecond ring member, the second ring member being the destination ringmember for which the message is intended. The message is passed alongthe ring network from the first ring member to the second ring member byone or more other ring members each having an address intermediate theaddresses of the first and second ring members, wherein the message isreceived and removed from the ring network upon receipt by the secondring member. The message can include information indicative of theaddress of second ring member. The communication processing system, inone embodiment, is implemented on a single chip, while in otherembodiments the system is implemented on more than one chip. In oneembodiment, the ring network includes a bridge across the ring networkthat allows messages to travel from one side to another side withoutpassing through intermediate ring members.

[1020] In one embodiment, the transaction based connectivity providesfor messages to be passed around the ring network according to aclocking scheme. The clocking scheme, in one implementation, providesfor the messages to travel one ring member per clock cycle. Similarly,the transaction based connectivity can provide for a plurality ofmessages to travel the ring network, each message traveling one ringmember per clock cycle unless a message is consumed at a given ringmember. The message arriving at a non-destination ring member can bepassed to the next ring member on the ring network or consumed by thedestination ring member. In one embodiment, each ring member receiving amessage checks a destination address portion of the message to determineif the message is intended for that ring member, and if the destinationaddress portion corresponds to that ring member, the ring member takesthe message off of the ring network and consumes the message. Ifconsumed, the message can be removed from the ring network while beingconsumed so that a slot on the ring network is made available, where theavailable slot enables a downstream ring member to insert a message inthe slot. The connectivity also may provide for messages comprisingtransactions to travel the ring network, and wherein the messagescomprise one or more of a command, an instruction, a type, an address,and data.

[1021] In one embodiment, the at least one message comprises a messagethat causes ring members to assign address space during configuration ofthe ring network. This message may comprise an enumeration message. Theassignment of address space during configuration allows a processingring member to subsequently infer the configuration of the ring network.

[1022] In accordance with an additional embodiment of the presentinvention a communications processor implemented on a chip. Thecommunications processor comprises a network processor including meansfor processing a plurality of protocols including ATM, frame relay,Ethernet, and IP, said means being programmable using a set of librarycommands to process additional protocols, and a protocol processor forcontrolling the network processor, wherein the protocol processorperforms control plane processing and the network processor performsdata plane processing. Further, the network processor and the protocolprocessor are ring members on at least one ring network, and wherein thecommunications processor further comprises a plurality of other ringmembers on the at least one ring network. The network processor, in oneembodiment, includes a plurality of compounds that share a single ringinterface to the ring network. The communications processor can be PHYneutral.

[1023] The at least one ring network, in one embodiment, comprisesmultiple ring networks including a protocol processor ring network and anetwork processor ring network, where the network processor ring networkcan include a first network processor for transmitting packets and asecond network processor for receiving packets.

[1024] In another embodiment, the network processor includes ultrafasttask switching using active registers for current tasks and shadowregisters for preloading next tasks. The communications processor mayfurther comprise multiple DMA controllers for access to externalmemories.

[1025] The protocol processor, in one embodiment, is adapted to performthe following: signaling protocols; protocol management; exceptionhandling; and system configuration and control. Similarly, the networkprocessor can be adapted to perform the following: per-packetprocessing; packet forwarding; packet classification; quality-of-servicehandling; and packet reformatting.

[1026] The control path protocol support can be provided by the protocolprocessor and the data path protocol support can be provided by thenetwork processor. Furthermore, the network processor can be adapted toperform zero overhead task switching.

[1027] In one embodiment, the network processor includes compoundmodules operating as parallel engines. The communications processor canbe implemented to provide an enterprise integrated access device (EIAD),a multi-tenant unit (MTU) or remote terminal unit (RTU), a mediagateway, and/or a voice gateway.

[1028] Exemplary Architectures of the HPCP

[1029] According to one embodiment, the HPCP is implemented using therings architecture as illustrated in FIG. 61. This rings-typearchitecture is implemented on a semiconductor (e.g., on a chip) and isunlike token-ring arrangements in networks. According to FIG. 61, theHPCP SOC 620 employs four rings 622-628 that are connected by threeinter-ring bridges 630-634. These bridges, also called sea bridgesbecause they interconnect two disparate rings, have logic such thatmessages will traverse from the near side ring across the bridge ifaddressed to the far side ring. If messages are addressed to an addresscontained within the near side ring, the message is forwarded along thering as in the usual case.

[1030] As illustrated, the HPCP 620 generally divides the modules alongthe rings according to functionality. There is a receiver (Rx) ring 628for receiving data transmitted from outside the HPCP chip. There is atransmitter (Tx) ring 626 for transmitting data to go outside the HPCPchip. There is a main ring or control ring (PP Ring) 622 which includesthe PP (packet processor) 636, which can be considered the host or CPU(anchor) of the HPCP. There is a packet processor ring 624 whichincludes several packet processors (i.e., the VCO 638 and VC1 640network processors) and DMAs 642, 644 for packet processing of thevarious protocols that are handled by the HPCP 620. In order to reducelatency in messaging, the packet processor ring 624 includes severalintra-ring bridges 646, 648, also called land bridges because theyprovide a bridge-type connection within a single ring.

[1031] In certain of the figures that follow, the illustration of theHPCP is not graphically depicted as a rings-type arrangement. However,unless stated otherwise, the arrangements correspond to a rings-typearrangement and logical path.

[1032] Generally, the improvement in the HPCP over other communicationsprocessors can be tied to, individually and in combination, the use of(1) a flexible packet processor with ultrafast task-switching, and (2)the any-to-any mesh internal rings-type communications architecture.This ensures architecture scalability for higher speed ports or higherport density. Additionally, the HPCP provides (3) a design for lowsystem cost. The usage of low cost memories (DDR-SDRAM) and the uniquestreamline memory architecture eliminates the need for high speed SRAMor external lookup engines (CAM). The primary beneficiaries of the HPCPare relatively high-end applications for the CPE and access markets.

[1033] Preferably, the HPCP supports an about 1.2 Gbps (simplex serialrate) rate for L2/L3 wire speed IP/ATM/TDM protocol processing. Asindicated above, the HPCP platform includes a core flexible packetprocessor (RISC [Reduced Instruction Set Computer] network processortechnology) and an SOC rings-type interconnect technology. This approachprovides a high performance programmable networking platform thatpermits rapid introduction of new features, new standards, and otherenhancements. The robustness of the HPCP allows it to be shared amongmultiple product lines. According to one embodiment, the HPCP isdesigned as a 0.18 micron, 520 HS-PBGA (Heat Spread Plastic Ball GridArray) chip.

[1034]FIG. 62 is a schematic diagram of an embodiment of the HPCP 620,sometimes referred to herein as the Trajan. According to thisembodiment, the HPCP 620 employs a rings-type communicationarchitecture, which is indicated on FIG. 62 as the Fabric on a Chip 670.The packet processor 672 (also referred to as control packet processor,MIPS, CPU, or simply, the host) functions as the control processor forthe HPCP 620. The packet processor 672 can be implemented using anysuitable processor. Preferably, the packet processor 672 has thefollowing characteristics: 266 MHz (preferably, MIPS) processor; MIPS-IInstruction Set; 16K 1, 16K D cache; supports Write back and Writeforward or through; has cache coherency; supports Direct Map; and has aMMU 64 TLBs (Translation Look-aside Buffers). Other suitablealternatives to a MIPS processor could be employed. The HPCP embodimentof FIG. 62 employs two network processors 674, 676 (Voblas) for packetprocessing. Preferably, the network processors 674, 676 are designed inaccordance with the flexible packet processor discussed elsewhereherein. Each of the network processors 674, 676 preferably communicateswith an operatively connected multi-access SRAM, which preferably has 72Kbytes of memory.

[1035] The HPCP embodiment of FIG. 62 employs three DMA modules, DMA678, DMA 680, and DMA 682. There also are two DDR-SDRAM controllers 684,686, each of which is capable of interfacing to a DDR-SDRAM 688, 690running at 133/166/200 MHz. Each controller supports a 32 bit data bus.The controller 684, 686 supports two masters (DMA and PP) and arbitersbetween them. An efficient packing algorithm is used to optimize memorytransactions. Coherency is reserved between the two masters and READ andWRITE operations. DMA 678 and DMA 680 can master the two memorycontrollers accordingly. Each can arbitrate for the memory bus and iscapable of bursts up to 64 bytes on a transaction.

[1036] The EPB (External Peripheral Bus) interface (I/f) 692 is used tointerface to a boot EPROM, Security Accelerators and a DSP (collectivelyfigure element 694). The EPB bus runs at 80 MHz with asynchronousaddress/data protocol. The EPB 692 also has five (5) dedicated ChipSelects (CS) and a special 32 bit CS bus transaction.

[1037] The HPCP of FIG. 62 includes a number of peripheral modules,including TDM 696, 4×Ethernet (MII and RMII) 698, a first ATM UtopiaLevel 2 700, a second ATM Utopia Level 2 702, a 3× MFSU 704, and an12C/SPI SW base 706.

[1038] The TDM module 696 may be used to support time divisionmultiplexing connectivity, such as for T1/E1. Preferably, the TDM module696 supports the following: up to 256 time slots; HDLC (high-level datalink control) and a transparent mode. The TDM module can also interfacehigh-speed TDM busses (backplane) such as H-MVIP, SCSA, H110, andST-BUS. The 4× Ethernet MII/RMII module 698 preferably supports 10/100Ethernet connectivity. The 3× MFSU module 704 preferably supports highspeed (up to 52 Mbps) HDLC or high-speed UART (Universal AsynchronousReceiver-Transmitter).

[1039] The HPCP has two ATM interfaces 700, 702 using Utopia Level 2.Each port can be configured for an 8 bit or 16 bit data path. The ATMport can be configured as master or as a slave. In a masterconfiguration, one port (subscriber port) can master up to 124 PHYs andthe second port (uplink or network port) can master up to 15 PHYs. Bothports can support an Extended Utopia Mode where the ATM cell length canbe extended from 53 bytes up to 64 bytes programmable.

[1040] Ring Interface on an EPB

[1041] As discussed previously, in at least one embodiment, the HPCP isimplemented using a ring architecture and message protocol as disclosedherein. As illustrated with reference to FIGS. 63-67 an externalinterface 720 may be implemented along with the EBP 692. An externalFPGA 722 that sits on EPB busses may play the roll of external ringkeeper, the Anchor can be external the network processor and on theFPGA. Of course, instead of FPGA it could be another HPCP. The input'sjob is to disable EPB operation for current transaction, and enablemovement of ring data. This input is driven by either by the FPGA or bysecond HPCP. The output's job is to tell the second HPCP or the externalFPGA who is the ring keeper, that the output data is for him. RegularEPB customers (like Flash) will look at the output as additional enable.One advantage of this arrangement is that a number of pre-existing pinsare used for part-time ring transactions. The speed of ring messaging-isreasonably high(same speed as of original EPB). The changes to existingEPB are minimal. The ring side implementation is exact copy of a bridgeplus state machine.

[1042] In the implementation illustrated in FIG. 63, 32 bits of datain/out is used to carry messages. There also is the potential to usealso the address bits, thus increasing the throughput, but complicatingthe design. Message_sync 724 is a relatively simple block that takescare of turning 60 (or 92) bits of outgoing message into several (2 . .. 3) transactions on EPB like interface. It also turns incoming data(from 2 . . . 3 transactions) to messages. On the inside partmessage_sync interfaces wit a regular bridge 726. Since EPB DMA 728 canpotentially sit on a busy ring, message_sync 724 and its bridge 726 canbe placed on the less busy ring.

[1043] The mux 730 takes data either from EPB 692 or from message_sync724 depending on the transaction. Handshake signals basically ask theEPB 692 to give up a cycle. And in the other direction EPB 692acknowledges the tristate or mux surrender. Using this fact, the chipselects can be disabled in ring-oriented transactions.

[1044] In order to program the message_sync 724 and EPB 692 toenable/disable external ring operation, the hardware can sample a pinduring power up reset. First, hardware reset puts message_sync indisabled mode such that during Enumeration, it passes on the Enumerationwithout attempting to talk to the other chip. The message_sync 724assigns to itself space of one address. After initial Enumeration, PPenables (or not) the message_sync to work. If message_sync 724 isenabled., second Enumeration is done. This time message_sync transmitsEnumeration message to the other chip. Then it waits for the message tocircle back to it.

[1045] The HPCP chip requires interfaces to various devices, which canserve as both slaves and masters (or both). Some of these devices are:DSPs 732; encryption engines 734; external buses such as PCI; externalmemories; and other HPCP chips. Some of these devices may directlyconnect to the EPB port, on the chip. However, in order to use thesedevices, a complex handshake is often required which would force the PPto assist in each transfer. In the case where these devices shouldinitiate a data transfer into the HPCP, a special mechanism is required,in order to avoid polling on the EPB port. The interface described isdesigned to allow a more robust and efficient connection of such devicesto the chip, and is consistent with the HPCP hardware and softwarearchitecture. FIGS. 64-67 describe the interface, starting from a systemview and ending with detailed block diagrams of the components.

[1046] Operation

[1047] The interface described above implements a ring interface toexternal logic, allowing the HPCP 740 to write out messages, andexternal devices to generate arbitrary ring messages in the HPCP 740.The FPGA 742, making the interface between the HPCP and the externaldevices 744, 746, serves as shared memory. This memory can beindependently accessed from both sides. In addition, accesses from theHPCP 740 can send messages to the external devices, and accesses fromthe external devices can generate messages on the HPCP ring. The DPR(dual port RAM) 780 is seen as both random access memory (RAM) or as aFIFO, depending on the access address. Two FIFOs 748, 750 areimplemented. One for receiving ring messages from the HPCP and one forsending messages to the HPCP.

[1048] Message Generation by the HPCP

[1049] When the ring interface recognizes a message to the externalinterface, a write burst is issued to the memory controller 760. Thiswrite has a fixed length of 128 bits. The write is always targeted tothe same address, being the write FIFO address in the external device.The external device indicates to the HPCP 740 when data is being readfrom the FIFO. The HPCP 740 knows in advance the size of the write FIFO,and therefore knows when it is possible to issue more write commands tothe memory controller. When it is no longer possible to issue writes,and all write buffers on the way are full, the OK signal to the ringinterface is de-asserted.

[1050] Message Generation by the External Device

[1051] The FIFO mapping is used to queue messages to be read by theHPCP. The FIFO memory is 128 bits wide (not all bits have to beimplemented in hardware). Each ring message occupies four 32-bit dataentries, to be read be the HPCP. When the message is complete (all 128bits written) the SYNC output to the HPCP is activated, indicating thata message has been written to the message queue. This allows the HPCP tokeep track of the number of messages written, and to read theappropriate number of messages.

[1052] The HPCP counts the number of messages entered into the queue ina request counter, and the number of read messages in a service counter.When there are pending messages (the request counter is greater than theservice counter) and the appropriate read port is free, the HPCP issuesa 128-bit read. The implementation of this read request depends on theport type. Ports that support burst transfers are issued one 128-bitburst read. Ports that support only 32-bit data transfers are issued 4reads. When the read request is complete, the service counter isincremented, indicating that an external message is served.

[1053] The data read from the port is used to generate a message. Whenall 128 data bits are received in the message sender, a message is sentto the ring interface.

[1054] Interrupts from HPCP to the External Device

[1055] The HPCP 740 can write data to special addresses that cause aninterrupt to the external device. These addresses can either be mappedfor interrupts only or interrupts and data (in the DPR 780).

[1056] General-Purpose Data Transfer

[1057] Besides sending messages from the external device to the HPCP,this interface serves as a buffered link for general-purpose datatransfer. The DPR can be read and written by both the HPCP and theexternal device. When the HPCP moves data for processing in the externaldevice, it writes the data to the DPR, and then causes an interrupt inthe external device by writing to an interrupt address.

[1058] The external device processes the data and returns it to the DPR.It then generates a message to the HPCP, indicates it to read back thedata from the DPR. Since the entire DPR can be mapped as a FIFO, theexternal device can also write the entire data directly to the NP memoryin the HPCP, and then notify the NP that the data is complete.

[1059] Supporting Multiple External Devices

[1060] One interface can support several external devices. Many DPRblocks can be implemented in a single FPGA, letting each of the externaldevices function independently. The message queue can either be unifiedinto a single FIFO with write arbitration, or can be made of severalFIFOs arbitrated during the message reads. Anyway, read or writearbitration is performed in the FPGA and is transparent to the HPCPchip.

[1061] Traffic Management

[1062] In one embodiment, traffic that is already on the ring getspriority. Also, modules may be designed to consume incoming messageswithout delay—or with well bounded delay. Futher, a virtual watch dogtimer can be implemented in the PP or one of the network processors. Inthis case, the watch dog timer periodically sends a message to itselfvia the ring. If this message is not there by the time the task isreawakened, indicating that the ring is locked and in need of a reset.

[1063] Memory Considerations

[1064] Network processor RAM can grow up to, for example 64 KB. Theproblem is, however, that this RAM uses 16 bits of ring addressingspace. So with 20 bits of address there can be approximately 8 networkprocessors in a reasonable system. Maximum theoretical number of networkprocessors is 16. But space may be needed for other modules as well.There is no great penalty to extend ring address space to more than 20bits and this can be done to accommodate design necessities, for thisexample just 20 bits.

[1065] Inside network processor compound there is more than RAM. Thereare doorbells, debug, timer and some more. They all need address space,but much more smaller. If they are assigned their own address space, theresulting address space used by network processor compound will be 128kbytes. This is because 65 KB is actually used, but because the addressspace is rounded to next power of 2, 65 k become 128 k.

[1066] Another aspect of the present invention is to steal a little bitof space from RAM on the rings and assign the low 1 k of bytes of ringaddress space to all the little modules. For example, doorbells take 64entries of address space (32 bit entries). When work write messagearrives for (vobla_base_address+32) it is routed to doorbells and not toRAM.

[1067] This effectively protects the lower portion of the RAM from thering network. network processor can still load/store and even fetch fromthere, because load/store does not access the Rings.

[1068]FIGS. 68 and 69 illustrate two typical scenarios for Tx and RxEthernet as may be implemented in accordance with the present invention.To summarize the Ethernet compounds:

[1069] The Rx manager 802 mis adapted to: send regular request,doorbell, taskid and viscode; header send Ahead—knows how many bytes,status and where; Multi read request service, for moving data to networkprocessor RAM; and know when to switch to urgent request.

[1070] The Tx manager 812 is adapted to: know when to starttransmitting, when to retransmit; when and how to issue a regularrequest—doorbell, taskid, viscode; perform free buffer count send ahead;perform urgent request—last buffer and not last in it; resend doorbellon request, if there are free entries in fifo—this is used by task thatadds frames to transmit queue; keep RAM status fifo of finishedframes—it sends tx complition status word and place to put it.

[1071] Rx Operation:

[1072] (1) Rx Frame starts incoming.

[1073] (2) It fills one entry (64 bytes) in fifo.

[1074] (3) Header+Status is pushed ahead to network processor RAM.

[1075] (4) Ring doorbell.

[1076] (5) Network processor switches to service the task.

[1077] (6) Network processor examines the header.

[1078] (7) Network processor sets up CRC snooper, especially the count.

[1079] (8) Network processor sends multi read request from the rxfifo.it takes 12+4 clocks, so network processor doesn't switch out, justpolls the crc snooper at the end if after rewarding whole fifo entry,there are still valid entries, new doorbell is ringed and new header issent ahead.

[1080] (9) Network processor issues DMA write request and yields out.

[1081] (10) DMA agent in network processor builds the messages to DMAbased on the DMA opcode, src registers data and DMA context registers.This context has the knowledge of DMA address, token availability,little/big endian, etc. Part of communication with DMA is also a newtoken request.

[1082] (11) when the DMA is done, it sends doorbell to re-awake thetask, to continue the work.

[1083] Tx Operation:

[1084] (1) ask that adds frames to transmit queue, adds a frame and alsosends a message to transmitter fifo if transmitter is not doinganything, ring doorbell of transmit task.

[1085] (2) transmit task is waken up by doorbell

[1086] (3) DMA read issued and network processor switches out

[1087] (4) when DMA is finished, multi read is issued from networkprocessor RAM to enet tx

[1088] (5) when fifo entry is full, tx starts transmitting

[1089] (6) tx fifo updates the number of empty fifo entries in networkprocessor RAM.

[1090] (7) if task detects empty buffers it can fill, it fills them andretires(8) when fifo entry is empty, the free count is sent ahead anddoorbell is rung.

[1091] (9) if last buffer is half full and it is not last, Enet fiforequests urgent (10) each time the frame is finished by Tx, the managersends status word to circular fifo in RAM. the manager uses singleaddress plus 2-3 bits counter to create the address, it also writes thecounter value in fixed location.

[1092] Programming Model for the HPCP

[1093]FIG. 70 illustrates the programming model 830 that may be employedfor the HPCP. According to FIG. 70, the packet processor 832 (PP orcontrol packet processor [CPP]) operates as the controller for the HPCP,performing such control plane functions as signaling protocols, protocolmanagement, handling exceptions (faults), and system control andconfiguration. The network processors 834, 836 perform data planefunctions such as per-packet handling, forwarding decisions, packetclassification, quality-of-service (QoS) handling, queuing, schedulingand packet re-formatting.

[1094] Data Path Protocol Support for the HPCP

[1095]FIG. 71 illustrates the data path and control path protocolsupport 840 provided according to a preferred embodiment of the HPCP.This data path protocol support is provided by the network processor(e.g., flexible packet processor) engine of the HPCP. Each protocolcapability shown in FIG. 71 is labeled according to its position in theOpen Systems Interface (OSI) layered protocol model. The legend for FIG.71 is as follows: (1)=Layer 1; (2)=Layer 2; (2*)=Layer 2 inter-working;(2.5*)=Layer 2.5 inter-working; and (3*)=Layer 3 inter-working.

[1096] The boxes in FIG. 71 labeled as (SM) illustrate the signaling andmanagement provided in order to manage the data path protocol supportaccording to a preferred embodiment of the HPCP. Preferably, thesignaling and management operations shown in FIG. 71 correspond to thecontrol plane operations performed by a CPP such as that shown in FIG.63.

[1097] A Packet Processor for the HPCP

[1098] A flexible packet processor that could be employed in the HPCPtypically includes capabilities, such as zero-overhead switching, notnormally present in general purpose processors. Accordingly, thepreferred packet processor provides the following characteristics:

[1099] nearly zero overhead task switch;

[1100] a Hardware scheduler (next_task_id)—strict priority scheme;

[1101] support for unlimited number of threads/tasks (e.g., 32simultaneous tasks);

[1102] allows connection to multiple external memories in parallel;

[1103] modular interface to accelerators;

[1104] compiler friendly;

[1105] tailored instruction set, with about 60 instructions for: ALU(Arithmetic Logic Unit), data manipulation, flow control, load/store,task management (yield), agent (Accelerators), SPR (Special PurposeRegister) move, and the like.

[1106]FIG. 72 is a block diagram of the packet processor 636 employed inthe HPCP 620 (FIG. 61) according to one embodiment of the invention. Thepacket processor 636 of FIG. 72 includes a packet processor core 850(Vobla core), an internal memory 852 for programming and data; and aseries of support submodules (compounds) for the packet processor, suchas a core debug 854, a doorbell 856, a CRC 858, timers 860, DMA agent862, and other agents 864. There is also an external interface 866 forinterfacing to the fabric. The packet processor core 850 includes aprogram sequencer 870 that further includes a sequencer 872, a decoder874 and a task switch block 876. There is also a load/store unit 880, apreload/bump unit 882, a register file unit 884, an arithmetic logicunit 886, and an agent interface module 888. A multiplexer 890 isdisposed between the internal memory 852 and the load/store unit 880 andpreload/bump unit 882. The packet processor 636 of FIG. 72 includes twosource buses and a destination bus in the core, and an agent bus forinterfacing with the agents.

[1107]FIG. 73 illustrates an exemplary processing pipeline 900 for apacket processor used in the HPCP according to an embodiment of theinvention. The pipeline 900 of FIG. 73 shows the steps carried out forthe execution of each packet processor. According to FIG. 73, first aninstruction is fetched. Then the instruction is decoded. The address fordata to be accessed is then calculated. The source registers are readand the instruction is executed. The result is then written into thedestination register.

[1108] Quality of Support Features for the HPCP

[1109] The HPCP may incorporate a number of quality of support (QoS)features according to one embodiment of the invention. For example, theHPCP may incorporate one or more of the following QOS operations: outputqueuing and scheduling; cell/frame pacing; IP classification (behavioraggregator); lookup engines; and congestion management. Preferably,these QOS operations are carried out by the packet processor implementedin the HPCP. The HPCP may provide frame-based output scheduling using anoutput scheduler. The output scheduler may provide a frame-based serviceto include: up to 8 configurable queues 910-924 per virtual/physicaltransmit queue; up to M ports Strict Priority (SP) 930, up to N ports ofWFQ (Weighted Fair Queue) 932; and up to L ports Low Priority (LP) 934.

[1110]FIG. 74 illustrates the output scheduling for the HPCP accordingto an embodiment of the invention.

[1111] Work conserving schedulers: Scheduling order empty 1-M, empty M Naccording to scheduler, and then empty N-L. The HPCP may providecell/frame pacing according to an embodiment of the invention. Forexample, an ATM pacer could employ a calendar wheel algorithm andprovide a cell-based service with traffic management for UBR, UBR+, CBR,VBR, and VBRrt.

[1112] A frame-based pacer (bandwidth limiter) may provide pacing perport in order to limit the port overall output to a predefined rate(e.g., allow a 100 Mbps uplink to be limited to a 12 Mbps service ifrequired).

[1113] Combining QOS for scheduling and pacing may be implemented in theHPCP as shown in FIG. 75. According to FIG. 75, the ports are fed to theconfigurable queues, which are then output as a UBR (unspecified bitrate) 940, VBR (variable bit rate) 942 or CBR (constant bit rate) datastream 946 to the calendar wheel algorithm 948. The output of thecalendar wheel algorithm 948 is fed to the Utopia interface 950.

[1114] The HPCP may provide IP packet classification according to anembodiment of the invention. Preferably, the HPCP provides IPv4 packetclassification.

[1115] The HPCP may provide this feature based on up to 512classification rules that are prioritized by order. The packetclassification is based on 5 or as many as 7 (see italicized fields)matching fields: IP Source Address; IP Destination Address; Protocol ID;TCP/UDP Source Port Number; TCP/UDP Destination Port Number; Type OfService (TOS) bits; and Physical/Logical I/f Port Number. The matchingcriteria may be based on an exact match, a prefix match, and/or a rangematch on each field. Classification rules can be set dynamically byprotocols such as MPLS or RSVP, or manually.

[1116] The HPCP may also provide address lookup engines according to anembodiment of the invention. At Layer 2, the following address lookupcapability is provided:

[1117] Ethernet MAC (Media Access Control) Address Uni-Cast/Multicast.

[1118] ATM VPI (Virtual Path ID)/VCI (Virtual Connection ID).Algorithmic approach supports single PHY and multi PHY.

[1119] MPLS Label Lookup.

[1120] At Layer 3, the following address lookup capability is provided:

[1121] IPv4 LPM (Longest Prefix Match) lookup.

[1122] The HPCP may also provide congestion management QOS according toan embodiment of the invention. The congestion management QOS includesrandom early detection (RED) per queue for frame based transmit queuesand ATM congestion recovery EPD and PPD (Early Packet Discard andPartial Packet Discard, respectively).

[1123] Exemplary Embodiments Showing Beneficial Applications for theHPCP

[1124] The HPCP (Trajan) is a versatile communications processor thatcan be used in many application scenarios. The HPCP's frame, cell andcircuit processing capabilities makes it well-suited for accessapplications. Set forth below are some exemplary application scenarioswhere HPCP can be used as a SBC (Single Board Computer) or on a linecard application in a chassis configuration.

[1125]FIG. 76 illustrates a exemplary application of the HPCP in orderto provide an Enterprise Integrated Access Device (E-IAD) 960.

[1126] Enterprise IADs are used at the edge of a corporate network. Thisclass of box or device is usually used at the edge of a corporate remoteoffice. The enterprise IAD manages the traffic from the internal LAN(Local Area Network) to the external WAN (Wide Area Network). The WANconnectivity can be a dedicated leased line (Intranet) and/orconnectivity to an ISP (Internet Service Provider), or both. An IAD willtypically also handle voice traffic, which may be from a directconnection to a PBX, or derived voice (over either ATM or IP networks).

[1127] The major tasks that an IAD needs to perform include routing,bridging, QoS prioritization (for voice packets), and inter-workingfunctionality (RFC 1483, T1 emulation using CES or FRF). The variousuplinks (WAN access methods) may be ATM, Frame Relay, and Ethernet. Themedia used by the uplink is typically either nxT1 for both ATM and FrameRelay and fiber for Ethernet and ATM.

[1128]FIG. 77 illustrates an exemplary application of the HPCP in orderto provide _Toc530832294a Multi Tenant Unit (MTU)/Remote Terminal Unit(RTU_Toc530832294) 970. An MTU is very similar to the IAD in design.Both applications reside in the customer premises.

[1129] The MTU device is physically located in a basement of a building,providing distribution of high speed Internet access to a building.Typical applications will distribute xDSL connections to theoffices/flats of a building using the existing copper infrastructure.The networking architecture will be stackable boxes using Ethernet orATM as the backbone network. The MTU will be connected to an externaledge router or the router functionality can be integrated into thesystem.

[1130] RTU's have similar functionality to an MTU (e.g., distribution ofxDSL connectivity to a remote neighborhood). Unlike an MTU, however, anRTU is physically located outside a premise: it is managed andmaintained by the ILEC or CLEC (Competitive Local Exchange Carrier). RTUfunctionality may be considered as a DSLAM, meaning the aggregation ofsubscriber's traffic into high-speed uplink. In terms of functionality,the RTU may be considered as an ATM switch.

[1131] The exemplary embodiment of FIG. 77 shows the MTU configurationwhere the HPCP can provide up to 62× DSL subscribers ports and 10/100Ethernet to the backbone network. In this scenario, the HPCP willperform the IP routing functionality or Ethernet bridging via RFC 1483.

[1132] In the RTU case, HPCP will perform ATM switching functionality,whereby user traffic will be policed according to the subscriber'scontracts on the subscriber side, and shaped towards the network side onthe aggregate (VP shaping). In this case, there is a support for totalof 61 subscribers. In the RTU case, the POTS (Plain Old TelephoneSystem) lines that are terminated at the RTU can be either backhauled ona separate TDM link, in which case there is no processing involving theHPCP, or can be packetized over ATM (CES or AAL2 trunking) using onepipe to backhaul both data and voice services.

[1133] Other exemplary uses of the HPCP include its application fora_Toc530832295 media gateway (MG) and voice gateway_Toc530832295 (VG).Many Telecom operators are updating their networks to support packetizedvoice services. One of the main driving forces is the savings ininfrastructure support that result from an operator being able tomaintain one network supporting both voice and data services.

[1134] A media gateway is a network element that links dissimilarnetworks, such as TDM to ATM or TDM to IP networks. Conceptually, themedia gateway consists of four elements: a TDM I/f, a transcodingengine, a gateway controller, and a packet network interface. On thecircuit-switched network side, a line card is used to connect the timeTDM channels from the PSTN to the gateway. A transcoding engine performsprocessing to convert between standards. A gateway controller managesthe gateway and call routing. Finally, a packet network interface routescalls between the gateway and the packet infrastructure.

[1135]FIG. 78 illustrates one exemplary application of the HPCP (Trajan)in a media gateway application 980. In the proposed scheme, the HPCPwill perform the networking protocols—both data path (termination andpacketization of AAL2 or RTP) and control using the PP (signalingprotocols such as MGCP, V5.2, GR-303). External DSPs will perform thetranscoding functions.

[1136] As shown in FIG. 78, an array of DSPs can be connected to theHPCP EPB (External Peripheral Bus). According to a proposed approach,FPGA mediator logic is used in order to boost the total systemperformance and to offload the PP processing bottleneck. Since many DSPvendors have a HOST PORT I/f as the mechanism to transfer data into/outof DSP memory, each transfer requires some control transactions (writeto host port control register). This operation is costly and requiresthe involvement of the PP in each transfer. When the number oftransactions is high, the PP will become a bottleneck. The solution isto create a protocol between the FPGA and the HPCP that can run in aburst mode and have the FPGA handle and manage the control side. TheHPCP provides packet network interfaces both for ATM (Utopia) and for IP(Ethernet or POS). Signaling information for the TDM network can betransferred to the TDM cross connect using the HPCP's TDM ports.

[1137] In a trunking gateway application, the HPCP can be connected bothto the TDM network and to the packet network and can perform the entireapplication.

[1138]FIG. 79 illustrates another exemplary application of the HPCP fora wireless access network (AN) 990. Wireless access networks (AN)consist of Base Transceiver Stations (BTS in 2G and NODE-B in 3G) andBase Station Controllers (BSC in 2G and RNC in 3G) that aggregates BTSs.The BTS interfaces between the radio network (RN) and the wirelineaccess network. The BSCs manage radio resources and network functionsbetween multiple BTSs and exchange traffic with the media gateway andthe packet switching node in the wireline core transport network.

[1139] Generally, a BTS is connected to the WAN using T1/E1 lines. Thetransport layer on the WAN is either ATM or IP. For utilization and QoSreasons in an ATM transport choice, AAL2 is chosen as the transportlayer. In this case, the BTS needs the following functionalities: ATMUNI functionality; wire-speed support for AAL2-Mux (I.366.1, I.366.2);and Inverse Multiplexer for ATM (IMA).

[1140] When the transport layer is IP-based, the BTS architecture willrequire the following functionalities: IP termination point; IP QoSsupport IP classification, Diffserv and enhanced queuing/schedulingalgorithms; RTP/UDP/IP header compression; and wire-speed support forPPP-Mux and/or ML-PPP.

[1141] In both architectures, the HPCP can be used as the central systemprocessor based on its ability to process wire speed ATM and IP with 8T1/E1 Interfaces to the WAN and Utopia or 10/100 interface to thebackplane. The HPCP can also be used in the BSC as the aggregationprocessor. In this case, the processor needs to perform IP routing andATM switching (AAL2 switching) at OC-3 rates (wire-speed).

[1142]FIG. 80 illustrates an exemplary application of the HPCP for amulti-service access platform 1000. A multi-service access platformcombines numerous functions, services, access technologies and protocolsin one network element. This flexibly configurable network elementsimplifies network design, planning, roll-out, and network management.Typical functions include the following:

[1143] Optical carrier (OC)-3c/12c/48c optical multiplexer

[1144] T3/OC3c aggregator

[1145] GR303 gateway

[1146] ATM switch

[1147] IP router

[1148] Access technologies include the following:

[1149] T1

[1150] T1-inverse multiplexing over ATM (IMA)

[1151] T3

[1152] XDSL ADSL, VDSL

[1153] Single-line highbit rate DSL (SHDSL)

[1154] Ethernet

[1155] Time division multiplexing (TDM), frame relay, ATM, and IP aresupported as protocols. The multi-service access platform providesoptimized network architecture and transport efficiency from thecustomer premises into the metropolitan area network (MAN).

[1156] The architecture of a multi-service access platform is shelfbased with an ATM and TDM backplane. Numerous subscriber (downlink) linecards connect customer premise equipment such as IADs, routers, and PABXand network elements as DSLAMS to the platform. The uplink connectivityis usually to an SDH/SONET network via an optical link. A special voicegateway subsystem can be added for termination of VoPacket .

[1157] The HPCP is positioned to fit in or be compatible with many linecards and trunk cards in a multi-service access platform application.For example, the HPCP can handle up to 8 T1/E1 Frame-Relay to ATMinterworking functions (FRF.5, FRF.8) on a line card; it can perform ATMswitching both on a LC or at the trunk card at 2× OC-3 rate. It can alsobe used to terminate 4 10/100 Ethernet links and perform 1483 Ethernetbridging, IP routing or SAR frames. Additionally, the HPCP can be usedto terminate PPP, PPPoE or PPPoATM traffic on an xDSL line card.

[1158] In terms of voice support, the HPCP can be used in the voicegateway subsystem to terminate VoATM or VoIP; it can also be used fortrunking application on the trunk card to take the narrowband trafficoff the TDM backplane and trunk it (AAL2 trunking or/and CES) towardsthe ATM network.

[1159] A major advantage for using the HPCP in a multi-service accessplatform application is its versatility in terms of 10 interfaces andprotocol support. A system designer can re-use board design, systemknowledge and expertise to leverage the HPCP as a networking platform inthe access space.

[1160] Exemplary Approaches to the Software in the HPCP

[1161] The software provided for the HPCP (HPCP software) is preferablyfully integrated with the HPCP hardware and architecture, highlyoptimized, and includes complete applications to support the myriad ofuses for the HPCP. According to one embodiment, the software developedand sold by, for example, GlobespanVirata, Inc. known as IntegratedSoftware on Silicon (ISOS) (e.g., ISOS version R8.0, etc.) can be run onthe HPCP. The IS05 software includes tools and a developmentalenvironment and is well-suited to the HPCP. The HPCP software includes acomplete port to various operating systems, such as VxWorks, Linux, OSE(a real-time kernel from Enea Systems), and ATMOS-2 (ATM-OperatingSystem [Virata's proprietary operating system]). The HPCP software maybe integrated with other software products, such as for Web management(e.g., the emWeb™ [embedded Web server] management product sold byGlobespanVirata, Inc.), UPnP, security and firewall functions. The HPCPsoftware may be integrated with voice processing software (e.g., thevCore™ voice DSP software sold by GlobespanVirata, Inc.) for voiceprocessing solutions.

[1162] Preferably, the HPCP software combines the software solutions forboth the CPP (MIPS) for the control plane and the packet processor forthe data plane. The HPCP software may include basic drivers for ATMAAL0, AAL1, AAL2, AAL5,Ethernet, HDLC, UART, Transparent (PCM), SPI and12C.

[1163] The data applications include support for bridging, such as forspanning tree (802.1d), prioritized bridge (802.1p), Ethernet toEthernet, and Ethernet to AAL5 (via RFC 1483). The data applications mayalso include support for routing and IP forwarding (such as RIP [RoutingInformation Protocol], OSPF [Open Shortest Path First] and MPLS), andfor frame relay.

[1164] The HPCP software may include voice applications, such as forVoATM (AAL2 [SS-SAR]). According to one embodiment, the HPCP software isfully integrated with the vCore™ voice DSP software sold byGlobespanVirata, Inc. of Red Bank, N.J. The HPCP voice applicationsinclude support for circuit emulation (e.g., CES [Circuit EmulationServices]) and VoIP (e.g., RTP/RTPC in the packet processor and MEGACO,MGCP and SIP [Session Initiation Protocol] in the CPP).

[1165] According to one embodiment, the CPP software package includes aflow manager element. The flow manager element creates applications bylinking micro-coded building blocks, is OS (operating system)independent, and provides a convenient API (Application ProgramInterface) for customers not wishing to use all other CPP software.

[1166]FIG. 81 illustrates of the flow manager functionality 1020according to an embodiment of the invention. As stated above, the HPCPsoftware may be integrated with voice processing software such as, forexample, the vCore voice DSP software sold by GlobespanVirata, Inc. forvoice processing.

[1167] Development of software for the HPCP may be facilitated throughthe use of certain data plane development tools. For example, afunctional network processor (packet processor) simulator may beemployed. GlobespanVirata, Inc. markets a packet processor simulatorcalled Vsim™ which may be employed for this purpose. Vsim™ is a highspeed system simulator which simulation includes the following: packetprocessor core Instruction Set (IS); functional behavior for DMAs;internal and external memories; and functional level peripherals. Vsim™provides performance analysis and includes traffic generators. Anotherdata plane development tool that may be employed is Vas™, which is astand-alone packet processor assembler. Another data plane developmenttool that may be employed is V-bug T™, which is an assembler leveldebugger. Another data plane development tool that could be employedwould be VCC™, a packet processor C compiler. Another data planedevelopment tool is V-GDB™, which is a packet processor C source leveldebugger (like V-bug™). Each of these tools can be hosted on a WindowsNT™ or Sun™ platform. Each of the aforementioned exemplary developmenttools is marketed by GlobespanVirata, Inc. FIG. 82 illustrates anexemplary data plane development 1030 that could be employed forsoftware development for the HPCP. The Vobla IS simulator 1032 refers tothe packet processor simulator. According to another approach, softwaredevelopment could be undertaken using reference platform hardwareinstead of the simulated modules.

[1168] Specific Strategies for the Software in the HPCP

[1169] Development of software to power the HPCP processor as describedherein is well within the skill of the ordinary artisan. Some of theconsiderations in designing the HPCP software are now discussed. Indeveloping the HPCP software, there are various tradeoffs to consider inproviding a software end-product that provides an acceptable balancebetween performance, robustness, portability, and other factors. For thebalance of the discussion in this section, the HPCP includes the packetprocessor (PP) (or control packet processor [CPP]) and the flexiblepacket processor referred to as the Vobla or NP (network processor).

[1170] Operating System and Portability

[1171] The main goal of HPCP software is to perform functions incooperation with the HPCP hardware to enable HPCP/Vobla chips to performas desired in communication systems. Taking into account the vastdiversity of different software embedded platforms currently used in themarket of communications processors (VxWorks, Linux, Nucleus, OSE,etc.), it seems reasonable to try to offer sufficient flexibility inHPCP software package to address different embedded environments anddifferent customer expectations for value-added software components.

[1172] In one manner, the HPCP could be an OEM (Original EquipmentManufacturer) product with a very limited software support package, suchas drivers and initialization sequence applications. On the other hand,main embedded software platform providers offer solutions allowingpotential customers to choose any preferable platform based on differentconsiderations (e.g., existing code base and experience, performance,value-added components, reference platforms and applications, etc.).

[1173] Balancing these considerations, the goal should be should try tofind those points where HPCP could be more attractive not only as a morepowerful communications processor but also as a more flexible andconvenient solution in different environments with more value-addedcomponents. One more consideration relates to system performance, whichmay depend on the particular embedded environment. For many popularembedded platforms (VxWorks, OSE, Linux, etc.), the introduced systemoverhead (which is usually measured in average system call processingtime and interrupt latency) is unacceptable for many applications. Thistriggers suggestions to use other light dedicated environments (e.g.,ATMOS, many home-grown simple monitors). Although the main networkprocessor driving force is moving most or all of the critical data pathcode to the NP microcode area (including most popular switching,interworking, bridging, routing and forwarding scenarios), theCPP-termination data path still needs to be efficient. Therefore,OS-dependent overheads must be kept to a minimum.

[1174] From the above considerations it is reasonable to formulate thefollowing HPCP SW-to-RTOS (Real-Time Operating System) integrationstrategy principles:

[1175] (1) HPCP software is to be provided in such portable form whichenables its easy integration with different existing (and future)embedded platforms.

[1176] (2) HPCP software should meet different customer expectations forvalue-added components. In other words, there should be the possibilityto offer different levels of support starting from simple objectlibraries providing low-level network processor drivers, throughsource-level packages allowing the generation of different libraries fordifferent customer applications, including glue interfaces for differentthird-party components and deliveries with more value-added componentswith different implementations. Exemplary embedded platforms that may bethe target for HPCP software integration include: VxWorks, Linux, OSE,CHAOS (a next generation ATMOS), Nucleus, PSOS, and others.

[1177] Configuring microcode applications. One of the innovations of theHPCP software is the placing of the critical data path functions to theNP microcode area. In this case the CPP serves mostly as acontrol/management plane for those data paths (data flows) created inthe NP and acts as the NP flow manager which represents the look andfeel model of the HPCP software. This approach assumes that othersoftware requirements and/or software design decisions should strivemeet the following main goal: NP processing should be as simple andeffective as possible, meaning that:

[1178] All data structures (tables, flow contexts, etc.) used by the NP(and possibly shared with CPP) should be designed to be the mosteffective from the NP code perspective.

[1179] NP should blindly perform flow-specific processing by callingdifferent functional blocks—the work for linking (stacking) of these NPfunctional blocks should be done in run-time by the CPP flow managercode when a request for new flow creation comes from the userapplication and/or control/management plane in the CPP. Such functionalstacking is done by proper linkage of flow contexts in the shared RAM.To implement these points, NP data structures are known to the CPP.

[1180]FIG. 83 illustrates a HPCP look and feel model 1040 as describedabove.

[1181] NP load configuration. Considering the vast diversity of networkapplications for the targeted market and also the intention to providean open communications processor architecture (i.e., the ability toprogram and add custom implementations to the NP microcode area), it isdesirable that the NP software load be configurable at compile-time.Configuration files (for setting compile-time parameters) may be seteither manually or, alternatively, via, for example, the System-Builder™tool available from GlobespanVirata, Inc. Each one of the several NPswithin a HPCP device may be loaded with a different microcode image.

[1182] Loading microcode. Dynamic NP code reload (i.e., changing the NPcode contents during run-time) is not be supported. The NP microcodeimage will be loaded only once at NP reset time and will contain allfunctionality needed by a particular network device. Note that the NPmay be reset by the CPP without a complete system reset occurring. Thisallows the user to change an NP load, after which the NP is soft-reset.

[1183] Control versus data plane processing. Much of the design of theHPCP software is aimed at extracting critical-path processing from theCPP and executing it in the NP. Critical path processing, in thiscontext, means processing that is performed on virtually all datapackets (or cells) on an interface. It varies from one application toanother and covers all layers of processing performed on the packet bythe HPCP. Therefore, there is a divergence from a strictly layeredarchitecture where the NP performs (for example) layer 2 and 3 and theCPP performs all higher layer processing in favor of a model in whichthe NP will preferably perform all critical path processing,irrespective of the layers involved (layers 2, 3 and, at times, layers 4and higher). The CPP, then, will perform all non-critical (or controlplane) processing—from layers 2 and up.

[1184] For example, in an OSPF router, the critical path may consist ofIP forwarding table lookups, ARP (Address Resolution Protocol) cachetable lookups (where successful) and forwarding. Non-critical pathfunctions will include all of the OSPF control plane (learningnext-hops, etc.), generating the ARP requests, and handling the ARPresponses.

[1185] Network Processor software design approach. Network Processormicrocode covering most of the data path processing is a componentimplemented from scratch in the HPCP SW project which makes itsperformance efficiency an important design goal. Other design goals areflexibility, expandability and architectural openness.

[1186] From the HPCP software look and feel model defined above, theATIC-like approach could be pretty useful for network processormicrocode design, which involves the following concepts.

[1187] Network Processor objects and contexts . The network processormicrocode may be divided into functional blocks, which may beoperationally joined (e.g., chained) in various combinations by theapplication builder in order to create different execution paths.

[1188] The concept of an object is introduced to describe a section ofcode that has a particular state. The object is an instantiation of anyentity that executes this code and has its own state information(referred to as its context). The context contains protocol stateinformation, necessary data structures and resources that have beendynamically allocated to the object. For example, an object's contextmay include a protocol state value, transmit queue of frames, timerinformation and links to subsequent objects in the execution path.

[1189] The context (i.e., associated data structures) belonging to anobject is object-dependent and the known only to the object itself.Objects have “next object” pointers and “next function” pointers. The“next object” indicates the object that will be activated after thecurrent object has completely handled its current event (similar to the“this” pointer for the next object in C++ terminology). The “nextfunction” pointer is the address of the routine that the next objectwill execute.

[1190] Different contexts for the Rx and the Tx parts of a flow (as isdone in the Helium TM communications processor sold by GlobespanVirata,Inc.) may be employed because in most cases Rx and Tx processing areindependent. This helps minimize the amount of control data needed to betransferred within the system.

[1191] Flexible mapping of Network Processor execution threads. In onemanner, mapping of functional processing blocks to the networkprocessor's threads (tasks) is performed not based on functionalbreakdown (i.e., task=protocol entity), but rather based on operationaleffectiveness.

[1192] With this approach, the network processor task is considered asan abstract operational vehicle capable of performing differentfunctional blocks and/or protocol stack layers depending on the type ofmessage in its input queue.

[1193] In order to optimize incoming message decoding, every messagewill contain at a pre-defined place (e.g., the first word) the pointerto the routine that will be called to handle the incoming message. Thisconcept, of course, could be used only for network processor taskshaving input queues. So-called HW network processor tasks (i.e., relatedto physical port specific processing) should be hard-coded to some portspecific function.

[1194] Task boundaries will break the continuous execution of a flow,but these do not necessarily need to coincide with protocol (or layer)boundaries. In general, these breaks in a flow should be avoided unlessfunctionally required since they add overhead. For example, inconfigurations involving a few different physical ports and/ornetworking applications, dedicated tasks empty/fill the serial port'sFIFOs in order to guarantee low latency, while other tasks runapplication code which does not have such hard real-time requirements.

[1195] Memory allocation/handling approach. It appears that all portlevel contexts would be better allocated in internal network processorSRAM (for the sake of effectiveness and also because their number islimited by physical chip configuration and allows static preallocation),while all other data structures (connection level contexts and lookuptables) are stored in external SDRAM and allocated dynamically.

[1196] Certain structures (e.g., lookup tables) may be partially locatedin internal and external memory spaces or configured to reside in eitherone or the other.

[1197] Memory allocations in both of the network processor's SRAM andthe external SDRAM are performed by the CPP. The network processorrecognizes SRAM partitioning either via compile-time definitions(initialization is done by CPP, which initializes the memory datastructures for the NP's tasks and for the different protocols ) or viapointers in a well-known area filled in by the CPP in run-time by itsSRAM manager.

[1198] Context and lookup data allocated dynamically in external SDRAMare processed by the network processor code after DMA'ing this data(only the needed part of it) to special areas in the network processor'sinternal SRAM. Buffer area for this data in SRAM is to be reserved in aper-task scratchpad area, which means that for abstract tasks (i.e.,tasks that are not oriented to some particular processing), thescratchpad area should be allocated to be big enough to fit the maximumsize of the context data being processed.

[1199] In one manner, only one copy of any context data should exist inSRAM at any given time. It is assumed that all context data is alwayscopied to a fixed offset in the task's scratchpad and that there is aone-to-one correspondence between any context data field and the networkprocessor task dealing with it (i.e., any data field is to be processedby only one network processor task).

[1200] At the same time, there could be considerable flexibility incontext data processing with the goal of gaining processingeffectiveness. For example, context data may be subdivided intosub-blocks where data in a sub-block is grouped based on a commonprocessing principle: few fields on context are grouped together to beDMA'd at one time (in one shot) when all/most of these fields are to bein use by a specific functional block. On the other hand, for example,specific statistics counters in context could be read-modify-writtenonly when the need arises (at end of a PDU [Packet Data Unit] or upon anerror). This allows processing of different context sub-blocks bydifferent network processor tasks. Of course, this approach makescontext data design more tricky and difficult. FIG. 84 illustrates thenetwork processor software design approach 1050 for an AAL5 receiverflow example.

[1201] Timers in Network Processor. A CPP-based timer service may beemployed via the network processor-to-CPP command interface (especiallywhen needed timers are big and are started/used rarely). Wheneverpossible, the internal free running timer for time-stamping of differentevents (e.g., to recognize reassembly timeouts) may be used. In thiscase, instead of getting a timer expiration event, a delta between thecurrent free-running timer and the previous timestamp is calculatedevery time (each timer event) and a timer expiration event is generatedwhere needed locally, without any message passing.

[1202] CPP software design approach. The CPP software design goals mayinclude the following:

[1203] (1) A simple and convenient API should be designed allowing easyintegration of CPP software with both different RTOS platforms and thirdparty products while using thin SW shims.

[1204] (2) Maximum possible reuse of existing control/management planecode base should be sought. This may entail the introduction of a newsimple SW shim and/or some restructuring of existing SW (i.e., theexisting GlobespanVirata ISOS code).

[1205] (3) The ISOS-ATIC convergence program and principles are to beconsidered when decisions about code base choice are made.

[1206] The aforedescribed look and feel model of HPCP software havingthe CPP SW function as the NP flow manager has the followingconsequences.

[1207] CPP control and data API considerations. A control API may beprovided for NP flows creation/deletion and their attributeschange/query. This API is to be used mostly by user applications, butalso (e.g., through a shim) by control/management plane SW (e.g., bysignaling protocol and/or an SNMP [Simple Network Management Protocol]agent).

[1208] It may be desirable to provide a generic control API with aminimal and fixed set of control primitives (e.g., similar to theso-called ISOS White interface). According to this approach, flow of any(including any future) type may be created/deleted using the samecontrol primitive (e.g., FLOW_CREATE) while flow type and otherattributes are provided as primitive parameters. Flow attributechange/query may be handled via generic primitives (e.g.,FLOW_GET_/FLOW_SET).

[1209] The text string used to pass flow type and attributes as aFLOW_CREATE primitive parameter seems to meet the requirement of APIgenerality, flexibility and expandability.

[1210] The FLOW_CREATE primitive can both create the data path protocollayer components and also link them together in different ways. Also, itis desirable to have primitive syntax traceable to protocolspecifications which makes its usage easier. It is feasible to start theneeded control plane component implicitly while processing FLOW_CREATEprimitives when proper parameters are supplied in the parameter string.Another requirement concerns the possibility of access to variouslayers/components created/linked by the FLOW_CREATE primitive, becausethe same protocol components could be involved in different flows.

[1211] For linkage of previously created termination flows ininterworking/bridging/routing applications, a special primitive(FLOW_LINK, FLOW_UNLINK) may be employed.

[1212] Implementation of the FLOW_CREATE primitive for a specific datapath protocol component (e.g., the CPP driver activated for particularflow type) can also be provided in the CPP data path processingtransparently for upper application if the proper network processormicrocode block is not yet available.

[1213] There may be a data API provided as well for termination datapassing to/from the NP. This API may be used both by user applicationsand the control/management plane SW. Receive termination and transmitconfirmation are bound via a standard call-back technique.

[1214] The goal that the NP code be simple, small and effective meansthat the CPP driver software activated via the control API for NP flowcreation/deletion/alteration/ query must recognize the flow contextinternal structure (even though this contradicts a strictobject-oriented approach). However, this is useful because it allowsboth effective flow building/removing without NP interaction and alsopermits easy integration with MIBs (Management Information Bases). Oneconsequence is that versions of the CPP and NP code should match exactlyand should be tightly linked to each other.

[1215] Linking of NP flows by the CPP assumes that CPP knows theaddresses of NP functional blocks which are inserted as next functionpointer in contexts. This could be achieved when the CPP load is builtwhile using symbol information of the previously built NP load. However,a difficulty arises when multiple NPs (e.g., with differentfunctionalities) are served by the same CPP. Thus, some mapping (flowtype to function block address) is needed that is specific for each NP.This could be implemented using a mapping array created in the NPinternal SRAM during NP initialization, which is then read by the CPPfor flow linking information retrieving.

[1216] The knowledge about internal flow context structure should stillbe localized in the particular CPP driver responsible for specific flowmanipulation. Additionally, care should be taken while updating contextdata shared by the CPP and the NP. The object is simple: For every fieldit is desirable to have only one write owner operating without memorylocking. If this is not possible, the CPP-to-NP command interface is tobe used to pass a write request to the write access owner of the data.Also, additional means may exist to ensure that both the CPP and the NPcode view or recognize a context structure in the same way. This mayinvolve various checks of the compatible software loads used in the CPPand the NP.

[1217] The same approach as outlined above is to be adopted for thevarious look-up tables used/updated by both the CPP and the NP. Thesetables seem may be handled by the control/management plane software inthe CPP. There may be no need to introduce a special API for the tableupdate in the CPP. Alternatively, there may be some table-specificdriver code which knows the particular table structure (chosen to bemore effective from the NP perspective) and which is activated (via a SWshim) from the control/plane software. Again, care should be taken inimplementing table update operations if a table could be changed fromboth cores, as well as in the case when the table update is acomplicated operation requiring a set of changes in differentplaces/entries.

[1218] Control and data API proposal. The following exemplary API meetsthe above design functionality and could be used as a basis for furtherdesign decisions:

[1219] NewFlowHandle=FLOW_CREATE

[1220] (ExistingFlowHandle, /type=FLOW_TYPE/param=PARAM);

[1221] status=FLOW_DELETE(ExistingFlowHandle);

[1222] status=FLOW_TRANSMIT(ExistingFlowHandle, Frame);

[1223] status=FLOW_SET(ExistingFlowHandle, attribute_name,attribute_value);

[1224] status=FLOW_GET(ExistingFlowHandle, attribute_name,&attribute_value);

[1225] status=FLOW_LINK(ExistingUpperFlow, ExistingTerminationFlow);

[1226] status=FLOW_UNLINK(ExistingUpperFlow, ExistingTerminationFlow);

[1227] Enabling and disabling of flows in the Tx and/or the Rxdirections could be implemented through a FLOW_SET primitive with properattributes (e.g., TxEn, TRUE) which also could be provided in theFLOW_CREATE parameter string.

[1228] Starting of the control plane component may be initiated via thesame FLOW_CREATE (or FLOW_SET) primitive. For example, creation of anAAL5 termination connection while starting corresponding OAM F5 processcould be as follows:

[1229] AtmPortHandle1=FLOW_CREATE(Voblald+PhysycalPortNumber,“e=UTOPIA/Phy=0/Name=A1) Aal5Handle FLOW_CREATE(AtmPortHandle1,/Type=AAL5/TxVci=5/TxVpi=0/Pcr=100/OamF5=Yes) FLOW_SET(Aal5Handle, RxHandler,0xADDRESS0) FLOW_SET(Aal5Handle,TxConfirmationHandler,0xADDRESS1)=/Type

[1230] The following example demonstrates the creation of a bridgeapplication over one Ethernet port and two RFC 1483 encapsulated AAL5connections created on different network processors. The IP terminationflow is multiplexed on one of the AAL5 VCI, starting spanning treeprocess as a control plane of bridge application, OAM F5 flow startedfor the other AAL5 VCI and ILMI initiated for one of ATM ports.

[1231] EthernetPortHandle=FLOW_CREATE(Vobla1+PhysicalPort2,/Type=Ethernet/Promisc=Yes) BridgeHandle=FLOW_CREATE(EthernetPortHandle,/Type=Bridge/Spanning=Yes)AtmPortHandle1=FLOW_CREATE(Vobla1+PhysicalPort1, /Type=UTOPIA/Phy=1) Aal5Handle1=FLOW_CREATE(AtmPortHandle1, /Type=AAL5/TxVci=5/TxVpi=0/Pcr=10000) Rfc1483Handle1=FLOW_CREATE(Aal5HanIdle1,/Type=Rfc1483)IpHandle1=FLOW_CREATE (Rfc1483Handle1, /Type=Ip/IpAddr=10.0.0.1/Mask=255.0.0.0) FLOW_SET(IpHandle1, IpRxHandler,0xADDRESS0)LanHandle1=FLOW_CREATE (Rfc1483Handle1, /Type=Ethernet) ATMPortHandle2=FLOW_CREATE(Vobla2+PhysicalPort3,/Type=UTOPIA/Phy=5/Ilmi=Yes) Aal5Hanidle2=FLOW_CREATE (AtmPortHandle2,/Type=AAL5/TxVci=20/TxVpi=1/Pcr=10000) FLOW_SET (Aal5HanIdle2,OamF5,Yes); Rfc1483Handle2=FLOW_CREATE (Aal5HanIdle2,/Type=Rfc1483)LanHandle2=FLOW_CREATE (Rfc1483Handle2, /Type=Ethernet)FLOW_LINK(BridgeHandle,LanHandle1) FLOW_LINK (BridgeHandle,LanHandle2)

[1232] CPP API thread safety. Both control and data termination APIs inthe CPP may be represented as a passive library (possibly provided inbinary form as a part of the platform specific BSP) handling primitivesfrom the user/control/management SW. These APIs should be thread safeand also should provide effective separation of control and dataprimitive flows. This avoids the scenario where processing of atermination data primitive is delayed because of control primitivehandling. An ATIC-like vertical thread optimization model can help tosolve such problems, and, in this case, API functions could beimplemented as wrappers that cause message sending where needed.

[1233] CPP system software base. The goal of supporting a vast diversityof different RTOS platforms suggests the use of ATIC system services andthe ATIC RTOS porting technique as a system base for CPP softwaredevelopment.

[1234] This approach is further desirable because ATIC system serviceshave been chosen as well as a preferable base for the ATIC-to-ISOSconvergence strategy.

[1235] Due to the high degree of similarity, the ISOS BUN frameworkcould be reused as the CPP API implementing framework, perhaps with fewchanges. This conceivably may allow the reuse of existing BUN driversand the same legacy peripheral ports for re-implementation on thenetwork processor.

[1236] HPCP Software Partitioning

[1237] The goal of this section is to characterize the HPCP softwarepartitioning as more or less independent blocks while trying to roughlydefine:

[1238] Functional specification of every block.

[1239] Interfaces between blocks and interfaces to outer world(external) software.

[1240] Strategy and estimation of possible software reuse and thedefinition of any needed shims.

[1241] The guiding principles used for software partitioning are thedesign approach defined in the previous discussion and the traditionalinformation hiding approach.

[1242] CPP software partitioning. FIG. 85 illustrates suggestedpartitioning and interfaces. According to an embodiment, the functionalblocks and interfaces of 85 are provided as follows. A first set maycorrespond to user or third party components. This first set may includethe following blocks in FIG. 85: user application 1070, socket interface1072, control plane software 1074, management plane software 1076, filesystem 1078, and console 1080. A second set may correspond to newcomponents created for the HPCP. This second set may include thefollowing blocks in FIG. 85: BSP 1082, Flow manager framework 1084,Functional driver 1086, Lookup table manager 088, Vobla RAM loader andinitializer 1090, Vobla SRAM manager 1098, Vobla queue interface 1092,Shims 1-5, Tracers and diags extension 1094, and Vobla frames/cells1096. A third set may correspond to existing (e.g., ATIC/ISOS)components. This third set may include the following blocks in FIG. 85:Network interface 1100 (between the Socket interface and Flow managerframework) and System services and OS porting 1102 (above Tracers, diagsextension).

[1243] Software Block Functional Specification.

[1244] Flow Manager Framework

[1245] This Flow Manager Framework block 1084 implements the networkprocessor Flow Manager API and provides the framework and services(attribute parsing and registration, data path stacking, etc.) forfunctional drivers. This component should also deal with API threadsafety mechanisms, control and data thread separation, and messagesending, wrapping, and queuing, as needed.

[1246] Shim 1—Flow Manager-to-Control Plane and FlowManager-to-Management Plane.

[1247] The control plane software to be supported may entail the use ofa set of shim layers for different control plane implementations. Thepurpose of Shim 1 is to provide for translation of connectioncreation/deletion primitives from the control plane to the networkprocessor flow creation/deletion primitives, and also to connect thecontrol plane to the flow termination data path. The same may be donefor different management plane implementations as well. For managementplane integration this shim also provides mapping of MIB GET/SET methodsto proper FLOW_SET/FLOW_GET calls.

[1248] Functional Driver Blocks

[1249] The number of different supported functional drivers may dependon the number of supported network protocols/applications. A particulardriver is responsible for implementation of flow create/deleteprimitives for flow of a particular type and also for linkage of flows.Termination data path functionality should be provided for all driversprimarily as a general service of the Flow Manager Framework.

[1250] The functions of the driver include:

[1251] Low level serial port initialization/deinitialization whileprocessing port level flow creation/deletion primitives.

[1252] Allocation and initialization/deallocation and deinitializationof port level static contexts in internal SRAM (via services of thenetwork processor SRAM manager) and lookup tables in external SDRAM (orinternal SRAM when so requested) while processing of port level flowcreation/deletion primitives.

[1253] Allocation/deallocation in external SDRAM connection levelcontexts and its initialization/deinitialization as result of connectionlevel flow creation/deletion primitive processing.

[1254] Linkage/delinkage of flows by setting next and next_functionpointers in proper contexts and lookup tables as result of flowcreate/delete/link/unlink primitive processing via usingflow_type-to-function mapping provided via services of the networkprocessor SRAM manager.

[1255] Implementation of driver specific FLOW_SET/GET primitives,particularly, create/start control plane protocols when possible and sorequested through attributes of FLOW_CREATE and FLOW_SET primitives.

[1256] Implementation of not ready yet data flow fragments, for example,for the AAL2 termination path. The SSSAR (Service Specific Segmentationand Reassembly) sublayer may be implemented by a functional driver inthe CPP if a microcode solution does not exist.

[1257] Lookup Table Manager 1088 and Shim 2

[1258] The Lookup Table Manager 1088 manages the modification of lookuptables of particular types and, accordingly, it recognizes or knows theinternal table structure (optimized for network processor microcodeusage). For various control/management plane components, Shim 2 gluelayers (which may be specific for each particular implementation) areprovided to implement access to the tables. Instead of providing ageneric API, every particular control/management plane component may berestructured to be operable with the network processor's lookup tablesusing a specific Shim 2 layer. When the lookup table is allocated inSRAM, the network processor SRAM Manager 1098 services are used foraccessing the lookup table. When the network processor is a table writeowner, modification of the table is done by sending command messagesthrough the network processor Queue Interface 1092 (discussed below).

[1259] Network Processor Queue Interface 1092

[1260] The network processor Queue Interface 1092 is responsible for theCPP-to-network processor interface. This component performs interfacepolling and/or interrupt processing, as well as the handling of messagesgoing to/from the queues on the interface and routing them to properrecipients. Network Processor SRAM Manager 1098. The network processorSRAM Manager 1098 coordinates all SRAM allocations and per-networkprocessor task SRAM partitioning and initialization. This componentprovides flow_type-to-microcode_function mapping functionality. It alsomay initialize all needed mapping information for access to differentagents on the network processor rings via learning the results of ringenumeration process (discussed previously).

[1261] Network Processor RAM Loader and Initializer 1090 and Shim 3

[1262] The network processor RAM Loader and Initializer 1090 isresponsible for the process of network processor image loading andhandshaking with the network processor starting code. Through differentShim 3 implementations, the network processor RAM Loader and Initializer1090 interfaces with different file system components to get the networkprocessor image for loading into the proper network processor.

[1263] System Services and OS Porting 1102

[1264] According to one approach, ATIC system services and the OSporting technique are to be used. Additionally, networkprocessor-specific frame/cell re-implementation is to be undertaken. Itis desirable to extend existing ATIC tracing/diags support to produce amore generic and convenient framework. Such a framework will allowactivation both in compile- and run-time for tracing of eventsregistered by different components both in the network processor and theCPP. For example, based on the suggested design approach for the networkprocessor and the CPP Flow Manager Framework, various tracers/injectorsmay be dynamically linked inside the data path between any of its flowfragments (e.g., similar to trace/debug BUN drivers).

[1265] Network Interface 1100 and Shim 4

[1266] The Network Interface (NI) 1100 connects the termination datapath to/from the Flow Manager with the native IP stack. Shim 4 is usedfor existing NI implementations for primitive translation.

[1267] Shim 5

[1268] Shim 5 is defined to connect the existing console implementationswith the Flow Manager FLOW_GET/SET interface.

[1269] BSP 1082

[1270] According to one approach, it is desirable to reuse an existingBSP 1082 for a similar chip (i.e., a chip with a MIPS core). This mayimpose additional requirements for reference board design. In that case,it might be feasible to reuse some of the BSP components (e.g., flashdrivers, memory initialization, etc.). At the same time, the main BSPfunction (i.e., to provide basic connectivity, typically for UART andEthernet/IP connections) is to be reimplemented in the networkprocessor. This might entail delivery as part of a BSP a simple networkprocessor image containing UART and Ethernet/IP support and needed CPPdrivers. In this case, the network processor image is a part of the CPPload on flash that is loaded to network processor via the networkprocessor RAM Loader and Initializer during system initialization. Inthis case, if a particular end-user gets the appropriate tools forcustomized network processor load building, this task should be a partof the BSP building process (UART+Ethernet/IP support should beselected). According to another approach, a BSP with JTAG (a serialdebug port)-based connectivity with the target could be employed. Inthis case, the combined CPP plus network processor(s) image can beviewed as the usual application load build.

[1271] CPP drivers should be integrated (through Flow Manager Frameworkand the proper shim) with the particular BSP driver framework.

[1272] Network Processor software partitioning . The goal of the networkprocessor software partitioning approach may be to have a maximum reuseof common code/algorithms while preserving processing efficiency byusing inlining and/or macros in the coding practice. FIG. 86 providesone possible partitioning approach 1200 for the network processor.Performance estimates for RFC 1483 bridging By way of example, aperformance estimate for RFC 1483 bridging can be computed as follows inTable 43. TABLE 43 RFC 1483 Bridging Performance Estimate Receive 128byte Ethernet back to back frames Receive frame - 120 cycles 802.1d -Ethernet bridging Bridge learning process - 50 cycles Enet addresslookup - 50 cycles Optional QoS support QoS decision via IPclassification - 500 cycles Per IP src/dst, src/dst port numbers andprotocol id Forward to transmit object - 10 cycles Transmit sideoperations Append 1483 encapsulation header - 10 cycles Optional QoSsupport AAL5 queue scheduling - 35 cycles RED - 15 cycles AAL5segmentation and transmit - 320 cycles (100+100+120) General overhead(inter-task msgs, etc.) - 50 cycles Total processing = 619 cycles (@200MHz = 323K pps) With QoS support = 1169 cycles (@ 200MHz = 171 K pps)Wire speed (full duplex) = 2*100M/(8*128) = 200K pps

[1273] Executing Branch Instructions Based on an Accumulative ConditionFlag

[1274] As discussed previously, in at least one embodiment, anaccumulative condition flag, i.e., sticky bit, is used by the HPCPand/or network processor to execute branch instructions. A conventionalprocessing device commonly performs a branching operation by pairing acompare instruction with a branch instruction. More specifically, such aprocessing device commonly performs the compare operation by subtractinga first specified operand from a second specified operand. As a resultof this operation, the processing device sets various conditions flags.Such flags provide information regarding the magnitude of the firstoperand relative to the second operand, and well as other informationregarding the operation. The subsequent branch instruction provides abranch in program execution on the basis of the values of the conditionflags. The condition flags are typically overwritten based on the nextinstruction executed by the processing device. Hence, the programmerwill typically include the branch instruction directly subsequent to arelevant compare instruction.

[1275] A typical program may contain a complex series of such pairingsof compare and branch instructions. FIG. 87 illustrates the execution ofsuch a program 1400. In step 1402, the processing device executes afirst compare instruction (i.e., the compare1 instruction). As mentionedabove, in this step, a first operand is subtracted from a secondoperand. The processing device also sets condition flags on the basis ofthe outcome of the comparing operation. Subsequently, in step 1404, theprocessing device executes a branch instruction on the basis of thevalues of the condition flags. That is, if the condition flags containprescribed values, the processing device advances to a specified branchaddress. In the illustrated case of FIG. 87, the processing devicebranches to address A if the compare1 instruction satisfies prescribedconditions, as reflected by the values of the condition flags.

[1276] As shown, the program 1400 contains multiple additional pairingsof compare and branch instructions. For instance, in step 1406, theprocessing device performs a second comparison operation (i.e., thecompare2 instruction). The processing device also resets the conditionflags on the basis of the outcome of the second comparing operation. Instep 1408, the processing device executes a branch instruction of thebasis of the new values of the condition flags. Namely, the processingdevice branches to address B if the compare2 instruction satisfiesprescribed conditions, as reflected by the value of the condition flags.

[1277] In step 1410, the processing device performs a third comparisonoperation (i.e., the compare3 instruction). Again, the processing devicealso resets the condition flags on the basis of the outcome of thecomparing operation. In step 1412, the processing device executes abranch instruction on the basis of the new values of the conditionflags. Namely, the processing device branches to address C if thecompare3 instruction satisfies prescribed conditions, as reflected bythe value of the condition flags.

[1278] Yet additional pairings of compare and branch instructions may beincluded (although not illustrated). Following the series of compare andbranch instructions, the program may include additional processing 1414.

[1279] The known technique shown in FIG. 87 may be applied in numerousapplications, such as in performing error check operations. For example,a network processor often performs a series of error checks prior toperforming a prescribed main processing task. In the IPv4 packet networkprotocol, for instance, the network processor checks to determinewhether the protocol version of information being processed is equal to4. The processing device may also determine whether the header of theinformation being processed is at least five words. The processingdevice may also determine whether the total length of the packet ofinformation is not grater than the length specified by the MAC layer.

[1280] The processing device may assign a different pair of compare andbranch instructions to each of the above requirements, as indicated inTable 44. TABLE 44 Instruction Index Action 1 compare1 2 branch if “notequal” to error1 3 compare2 4 branch if “less equal” to error2 5compare3 6 branch if “greater than” to error3 7-n additional processing

[1281] The first and second instructions identified correspond to steps1402 and 1404 of FIG. 87. The third and fourth instructions correspondto steps 1406 and 1408 of FIG. 87. The fifth and sixth instructionscorrespond to steps 1410 and 1412 of FIG. 87. The indicated additionalprocessing in steps 7 et seq. corresponds to step 1412 of FIG. 87.

[1282] The technique described above has shortcomings. Namely, theproliferation of branch instructions in a program reduces the efficiencyof the processing device. For instance, each of the branch instructionstakes a prescribed amount of time to perform. Thus, a program thatincludes a multitude of such instructions may suffer from processingdelays. Further, a lengthy program comprising several compare and branchinstructions also requires sufficient memory capacity to store theprogram, and therefore detracts from efforts to deploy the processingdevice in computationally sparse technical environments.

[1283] Further, in the above-noted IPv4 application, the processingdevice may encounter the above-described error conditions relativelyinfrequently. In this sense, these conditions are considered rare.Nevertheless, the processing device must sequence through theabove-described six error checking instructions before advancing to themain processing routine (e.g., in step 1414 of FIG. 87). In view ofthese factors, the use of multiple branching instructions appears toimpose an unwarranted bottleneck in the course of normal processing ofIPv4 data. For all of the above reasons, the use of branch instructionsis considered expensive to a design implementation.

[1284] The apparatus and method described herein is applicable to anytype of processing environment. For example, FIG. 88 provides one suchgeneral processing environment 1500 for the purposes of illustration.The environment 1500 includes a processing device 1502, including acentral processing unit (CPU) 1504. The processing device 1502 may alsoinclude other conventional processing units coupled to the processingunit 1504, such as memory 508, cache 1506, and communication interface1510. The CPU 1504 serves as a central engine for executing machineinstructions. The memory 1508 (such as a Random Access Memory, or RAM)and cache 1506 serve the conventional role of storing program code andother information for use by the processor 1504 in performing itsascribed functions. The communication interface 1510 serves theconventional role of interacting with external equipment, such as thenetwork 1402, or some other peripheral device.

[1285] The processing device 1502 also includes program functionality1512 for executing various processing functions. This programfunctionality 1512 may be implemented as software stored in memory(e.g., memory 1508, or some other memory). As indicated in FIG. 88, theprogram functionality 1512 may include one or more programs 1514 thatare specifically designed to make use of the unique branching techniqueof the present invention, to be described in greater detail below.

[1286] The processing device 1502 may include additional hardware and/orsoftware to serve specific computational roles. For instance, theprocessing device 1502 may comprise an apparatus having hardware and/orsoftware functionality specifically adapted for communication with apacket network, such as network 1516. For instance, the packet network1516 may comprise any type of local-area or wide-area network fortransmitting data in packet format. More specifically, the packetnetwork 1516 preferably comprises some type of network governed by theIP/TCP protocol, such as the Internet, or an intranet. The network mayinclude any types of physical links, such as fiber-based links, wirelesslinks, copper-based links, etc.

[1287]FIG. 89 provides additional details regarding an exemplaryarchitecture of the processing unit 1504. The processing unit 1504 mayinclude an arithmetic logic module (ALU) 1602, a control logic module1604, input/output (I/O) logic module 1606, and various workingregisters 1608.

[1288] The control logic module 1604 includes logic for decoding andexecuting machine instructions. To this end, this module 1604 mayinclude conventional features, such as an instruction register forholding an instruction while it is being processed by the processingdevice 1502, a program counter, etc. The control logic module 1604 mayfurther include one or more storage locations 1630 for storing conditionflags. As described above in the Background section, the processingdevice 1502 modifies the contents of the condition flags when aninstruction is performed by the processing device 1502, so as toindicate the outcome of the instruction. Different processing devicesdesigned by different manufacturers employ different sets of processingflags. Known flags include an SF flag which is equal of the MSB (mostsignificant bit) of the result of an operation, indicating whether theresult was negative or non-negative. A ZF flat is set to 1 if the resultof an operation is 0. A CF is set 1 if the result of an operationproduces a carry. Still other types of flags are known to those skilledin the art.

[1289] In addition, the solution described herein provides at least oneadditional condition flag referred to as an accumulative flag 1632.Unlike the other flags, the accumulative flag 1632 may provide a valuethat reflects the outcome of more than one instruction. For instance,after a sequence of three compare instructions, the condition flag maybe set to indicate whether any of these three instructions satisfypre-established conditions. In other words, the accumulative flag 1632in this case represents the logical OR of the separate compareinstructions. The flag is referred to accumulative in the sense that itsfinal status reflects the accumulation of separate determinations madein separate compare instructions (or other instructions). It is alsoappropriate to refer to this flag as a sticky flag. The flag is stickyin the sense that it can remain set for multiple computer instructions(such as multiple compare instructions). That is, unlike the known art,the accumulative (or sticky) flag 1632 not change after every computerinstruction (such as after every compare instruction). Additionaldetails regarding the use of the accumulative flag are presented below.

[1290] The flags stored in storage 1630 may comprise binary informationexpressed in one or more bits. The storage 1630 may contain a singleaccumulative flag, or multiple accumulative flags.

[1291] The ALU 1602 performs various logical and arithmetic operationsin a conventional manner. The I/O logic 1606 coordinates transfer ofinformation between the processing unit 1504 and other modules in theenvironment 1500 in a conventional manner. The working registers 1608retain information for use in the execution of program instructions, andmay include various conventional address registers and arithmeticregisters.

[1292]FIG. 90 describes an exemplary method for executing programinstructions based on the value of the accumulative flag. It begins instep 1402, where the processing device executes a first compareinstruction (i.e., the compare1 instruction). As mentioned above, inthis step, a first operand is subtracted from a second operand. Theprocessing device also sets the value of the accumulative flag toreflect whether the compare1 instruction satisfies a first prescribedcondition. In step 1404, the processing device executes a second compareinstruction (i.e., the compare2 instruction). The processing device alsoupdates the value of the accumulative flag to reflect whether either thecompare1 instruction satisfies the first prescribed condition, orwhether the compare2 instruction satisfies a second prescribedcondition. In step 1404, the processing device executes a third compareinstruction (i.e., the compare3 instruction). The processing device alsoupdates the value of the accumulative flag to reflect whether any of thecompare1, compare2, or compare3 instructions satisfy their respectiveprescribed conditions. Yet additional compare instructions may beincluded (although not illustrated).

[1293] After the series of compare instructions, in step 1708, theprocessing device executes a branch instruction based on the value ofthe accumulative flag. At this stage, the accumulative flag reflectswhether any one of the first through third compare instructions producedan outcome which satisfies its respective prescribed condition. In thissense, the accumulative flag reflects the logical OR of individualcondition flag values produced in preceding comparison steps. This is inmarked contrast with the known prior art, where the condition bitsstrictly reflected the outcome of the single instruction that was lastperformed.

[1294] If the accumulative flag is set, then the processing devicebranches to an indicated address (in this case, address D). If theaccumulative flag is not set, then the processing device advances to theremainder of the program, generically represented as instructions 1710in FIG. 90.

[1295] Two examples serve to further clarify the exemplary use of theabove-described technique.

[1296] 1) Example A: Error Checking

[1297] The technique shown in FIG. 90 may be applied in numerousapplications, such as in performing error checks. As mentioned above, anetwork processor often performs a series of error checks prior toperforming a prescribed main processing task. In the IPv4 packet networkprotocol, for instance, the network processor checks to determinewhether the protocol version of information being processed is equal to4. The processing device may also determine whether the header of theinformation being process is at least five words. The processing devicemay also determine whether the total length of the packet of informationis not greater than the length specified by the MAC layer.

[1298] In contrast to the approach described in FIG. 87, the techniqueshown in FIG. 90 performs the above-described three comparisonoperations, followed by a single branch instruction based on theaccumulative flag that reflects the accumulative outcome of the threecomparison operations. Table 45 illustrates the series of instructionsused to perform the error check using the technique of FIG. 90. TABLE 45Instruction Index Action 1 compare1, overwrite accumulative flag with“not equal” condition 2 compare2, set accumulative flag if “less equal,”and otherwise maintain accumulative flag if set in prior operation 3compare3, set accumulative flag if “greater than,” and otherwisemaintain accumulative flag if set in prior operations 4 branch ifaccumulative flag is true to error1_or2_or3 5 additional processing

[1299] The first through third instructions correspond to steps 1402 to1706, respectively, of FIG. 90. The accumulative outcome of these threecompare operations sets the value of the accumulative flag if any of theerror conditions reflected in the three comparison operations hold true.The fourth instruction corresponds to step 1708 in FIG. 90. Theindicated additional processing in steps 7 et seq. corresponds to step1710 of FIG. 88.

[1300] A comparison of the technique shown in FIG. 90 with the techniqueshown in FIG. 87 illustrates the merits of the present invention withrespect to the known art. For instance, the technique shown in FIG. 87uses six instructions to accomplish the error checking operation. Incontrast, the technique shown in FIG. 90 uses only four instructions toaccomplish the error checking.

[1301] It will be noted that the technique shown in FIG. 90 provides asingle branch instruction when any of the extreme error conditions arepresent, and hence does not provide branching that is specific toindividual error conditions. Nevertheless, these extreme errorconditions are relatively rare. Thus, it is preferred to streamline theprocess which checks for these errors by reducing the number of requiredbranching operations. In the relatively rare event that an errorcondition is encountered, then the processing device can thendiscriminate the exact cause of the failure in a separate routinewithout presenting a bottleneck situation to normal error-freeprocessing.

[1302] 2) Example 2: Logical Operations (e.g., AND and OR operations)

[1303] The technique shown in FIG. 90 also may streamline the executionof various logical operations, such as various operations that involveAND and OR logical operations. Consider, for example, the case where aprogram requires branching in the event that the following condition (1)is true:

[1304] if (a>=7 AND b<8) then goto label D (1).

[1305] In the known technique, testing this condition would require theexecution of multiple pairs of compare and branch instructions. In thepresent technique, the operation may be performed using a series ofcompare operations following by a single branch instruction.

[1306] More specifically, it should first be noted that condition (1)may be rephrased in the negative using OR logic (e.g., the expression cAND d can be expressed as NOT (NOT c OR NOT d)). With this in mind, thecondition (1) can be executed by performing the following series ofinstructions using the accumulative flag:

[1307] cmp.o.It a, 7

[1308] cmp.ge b, 8

[1309] bc.accumulativeo label D.

[1310] The first instruction commands the processing device to compareoperand “a” with the value 7, and then set the accumulative flag ifoperand “a” is equal to or less than 7 (and clear it otherwise). Thesecond instruction commands the processing device to compare operand “b”with the value 8, and then to set the accumulative flag if the operand“b” is greater than or equal to 8. It will be noted that theseoperations are the opposite of the condition (1) because theinstructions are executing using the negative counterpart of thisequation. The third instruction commands the processing device to branchto label D if the final value of the accumulative flag is 0.

[1311] The following Truth Table 46 illustrates different scenariosdepending on the input values of operands “a” and “b”. TABLE 46accumulative accumulative flag after flag after second a>=7 b<8 resultfirst compare compare 0 0 0 1 1 0 1 0 1 1 1 0 0 0 1 1 1 1 0 0

[1312] A similar, but complementary, series of instructions may be usedto implement the condition:

[1313] if (a>=7 OR b<8) then goto label D (2).

[1314] Namely, the instructions for implementing this condition are asfollows.

[1315] cmp.o.ge a, 7

[1316] cmp.It b, 8

[1317] bc.accumulative1 label D.

[1318] The first instruction commands the processing device to compareoperand “a” with the value 7, and then set the accumulative flag ifoperand “a” is equal to or greater than 7 (and clear it otherwise). Thesecond instruction commands the processing device to compare operand “b”with the value 8, and then to set the accumulative flag if the operand“b” is less than 8. The third instruction commands the processing deviceto branch to label D if the final value of the accumulative flag is 1.It will be noted that there is no need to negate the operationsdescribed in the above condition, as a logical OR is being performed inthis case (rather than an AND operation).

[1319] Finally, the following Truth Table 47 illustrates differentscenarios depending on the input values of operands “a” and “b”. TABLE47 accumulative accumulative flag after flag after second a >= 7 b < 8result first compare compare 0 0 0 0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1

[1320] In typical processors, many instructions can be predicated(conditional) on any condition code. In the ARM processor, for example,4 opcode bits are required. However, in one implementation of thepresent invention, instructions can be predicated using only the stickybit, requiring only two opcode bits (one bit forconditional/unconditional and one bit for bit 0/bit 1).

[1321] Although the above-described invention was described in thecontext of multiple compare instructions following by a single branchinstruction, it has general applicability to other types of processinginstructions. Likewise, the present invention can be implemented for anynumber of compares in combination with any number of AND/OR operations(e.g., (a>7 AND b==8) OR c !=9)). Generally, the invention may beapplied to the generic case where an accumulative flag is set based onwhether either a first or second instruction satisfy their respectiveprescribed conditions. Then, a third instruction performs some otheroperation that is conditional on the value of the accumulative flag.

[1322] In accordance with one embodiment of the present invention, amethod for executing machine instructions in a processing device isprovided. The method comprises the steps of executing a firstinstruction, identifying whether an outcome of the execution of thefirst instruction satisfies a first specified condition, and setting anaccumulative flag result which reflects whether the first instructionsatisfies the first specified condition. The method further comprisesthe steps of executing at least a second additional instruction,identifying whether an outcome of the execution of the secondinstruction satisfies a second specified condition, and updating theaccumulative flag depending on whether either the first instruction orthe second instruction satisfy their respective first and secondspecified conditions, and a third instruction based on the value of theaccumulative flag subsequent to the execution of the first and secondinstructions. The first and second instructions, in one embodiment, arecompare instructions that each compare a first operand with a secondoperand. The third instruction, in one embodiment, is a branchinstruction which bases its branching determination on the value of theaccumulative flag. In another embodiment, the first and secondinstructions are compare instructions that each compare a first operandwith a second operand, and wherein the third is a branch instructionwhich bases its branching determination on the value of the accumulativeflag.

[1323] In one embodiment, the compare instructions of the above methoddetermine whether two respective error conditions are present, and thebranch instruction bases it branching determination on whether either ofthe two respective error conditions are present, as reflected by thevalue of the accumulative flag after the second compare instruction isperformed.

[1324] In accordance with another embodiment of the present invention, acomputer readable medium containing program code for execution by aprocessing device is provided. The medium includes a first instructionfor performing a first operation, which, when executed by the processingdevice, generates a first outcome result, at least a second additionalinstruction for performing a second operation, which, when executed bythe processing device, generates a second outcome result, and at leastan additional third instruction for performing a third operation basedon an accumulative flag, wherein the accumulative flab represents thelogical OR of the first and second outcomes. In one embodiment, thefirst and second instructions are compare instructions that each comparea first operand with a second operand. In another embodiment, the thirdinstruction is a branch instruction which bases its branchingdetermination on the value of the accumulative flag. In yet anotherembodiment, the first and second instructions are compare instructionsthat each compare a first operand with a second operand, and wherein thethird instruction is a branch instruction which bases its branchingdetermination on the value of the accumulative flag.

[1325] In one embodiment, the compare instructions determine whether tworespective error conditions are present, and the branch instructionbases it branching determination on whether either of the two respectiveerror conditions are present, as reflected by the value of theaccumulative flag after the second compare instruction is performed.

[1326] In accordance with another embodiment of the present invention,an apparatus for executing machine instructions is provided. Theapparatus comprises a storage for storing an accumulative flag, logicfor executing instructions and for determining whether the outcomes ofthe instructions satisfy respective prescribed conditions, logic forsetting the accumulative flag to reflect the outcomes of theinstructions, wherein the logic for setting the accumulative flagincludes logic for determining the value of the accumulative flag basedon the logical OR of at least first and second instructions, and whereinthe logic for executing instructions also includes logic for executingat least an additional third instruction based on the value of theaccumulative flag stored in the storage. In one embodiment, the firstand second instructions are compare instructions that each compare afirst operand with a second operand. The third instruction can include abranch instruction which bases its branching determination on the valueof the accumulative flag. Furthermore, the first and second instructionscan include compare instructions that each compare a first operand witha second operand, and wherein the third instruction is a branchinstruction which bases its branching determination on the value of theaccumulative flag. The compare instructions, in one embodiment,determine whether two respective error conditions are present, and thebranch instruction bases it branching determination on whether either ofthe two respective error conditions are present, as reflected by thevalue of the accumulative flag after the second compare instruction isperformed.

[1327] In accordance with an additional embodiment of the presentinvention, an apparatus for executing machine instructions is provided.The apparatus comprises a storage for storing an accumulative flag,logic for executing instructions and for determining whether theoutcomes of the instructions satisfy respective prescribed conditions,logic for setting the accumulative flag depending on the outcomes of theexecuted instructions, wherein the logic for setting the accumulativeflag includes logic for determining the value of the accumulative flagbased on whether at least one instruction within a group of at least twoinstructions had an outcome which satisfied its respective prescribedcondition, and another storage for storing a program that comprisesplural instructions, including: a first instruction for performing afirst operation, which, when executed by the processing device,generates a first outcome result; at least a second additionalinstruction for performing a second operation, which, when executed bythe logic for executing, generates a second outcome result; and at leastan additional third instruction for performing a third operation basedon an accumulative flag.

[1328] The first and second instructions, in one embodiment, are compareinstructions that each compare a first operand with a second operand.The third instruction can include a branch instruction which bases itsbranching determination on the value of the accumulative flag.Furthermore the first and second instructions can include compareinstructions that each compare a first operand with a second operandwhile the third instruction includes a branch instruction which basesits branching determination on the value of the accumulative flag.

[1329] In one embodiment, the compare instructions determine whether tworespective error conditions are present, and the branch instructionbases it branching determination on whether either of the two respectiveerror conditions are present, as reflected by the value of theaccumulative flag after the second compare instruction is performed.

[1330] While the foregoing description includes many details andspecificities, it is to be understood that these have-been included forpurposes of explanation only, and are not to be interpreted aslimitations of the present invention. Many modifications to theembodiments described above can be made without departing from thespirit and scope of the invention.

What is claimed is:
 1. A network processor implemented on a chip,comprising: means for processing a plurality of protocols including ATM,frame relay, Ethernet, and IP; said means being programmable using a setof library commands to process additional protocols; wherein said meanscomprises an arithmetic logic unit (ALU), a load/store unit (LSU), apreload/bump unit (PBU), a register file unit (RFU), an agent interface,and an internal memory.
 2. The network processor of claim 1, furthercomprising a fetch unit and a program sequencer.
 3. The networkprocessor of claim 1, wherein the ALU performs arithmetic and logicoperations on data operands.
 4. The network processor of claim 1,wherein the LSU performs address calculations in order to address dataoperands in the internal memory.
 5. The network processor of claim 4,wherein the LSU calculates an effective address according to one of fiveavailable options, including: (1) effective address is the content of aregister from the RFU; (2) effective address is the sum of content of afirst register from the RFU and content of a second register from theRFU; (3) effective address is the sum of content a first register fromthe RFU and content of a second register from the RFU after the secondregister is shifted by a specified number of bits; (4) effective addressis the sum of the content of a register from the RFU and a displacementthat occupies a specified number of bits in an instruction word; and (5)effective address is an absolute address included in the instructionword.
 6. The network processor of claim 1, wherein the RFU comprises afirst register file for a current task and a second register file forpreloading register values for a next task.
 7. The network processor ofclaim 6, wherein data is read to or written from the first register filebased on a comparison between a current task ID and a task ID associatedwith the first register file.
 8. The network processor of claim 6,wherein the RFU comprises a third register file for storing registervalues for the current task that are not stored in the first registerfile.
 9. The network processor of claim 8, wherein data is read to orwritten to the third register file when the current task ID and the taskID associated with the first register file are not the same.
 10. Thenetwork processor of claim 2, wherein the PSU performs decoding ofinstructions received from the internal memory. 11 . The networkprocessor of claim 9, wherein the fetch unit controls what instructionsare fetched from memory for decoding by the PSU.
 12. The networkprocessor of claim 8, wherein a task switch is performed by making thenext task the current task and preloading a further next task.
 13. Thenetwork processor of claim 12, wherein performance of a task switchincludes treating the second register file as the third register fileafter the task switch.
 14. The network processor of claim 1, wherein theagent interface allows the network processor to interface to externalmodules for executing instructions.
 15. The network processor of claim14, wherein the external modules include one or more of a CRC module,encryption module, hashing module, and table lookup module.
 16. Thenetwork processor of claim 1, wherein the internal memory for storingprogram information and data.
 17. A communications processor implementedon a chip, comprising: a network processor including means forprocessing a plurality of protocols including ATM, frame relay,Ethernet, and IP, said means being programmable using a set of librarycommands to process additional protocols, wherein said means comprisesan arithmetic logic unit (ALU), a load/store unit (LSU), a preload/bumpunit (PBU), a register file unit (RFU), an agent interface, and aninternal memory; a protocol processor for controlling the networkprocessor; wherein the protocol processor performs control planeprocessing and the network processor performs data plane processing. 18.The communications processor of claim 17, wherein the network processorprocesses instructions by performing a fetch, decode, address, execute,and a write.
 19. The communications processor of claim 17, wherein thenetwork processor and the protocol processor are ring members on a ringnetwork, and further comprising a plurality of other ring members on thering network.
 20. The communications processor of claim 19, wherein thenetwork processor includes a plurality of compounds that share a singlering interface to the ring network.
 21. The communications processor ofclaim 20, wherein the compounds include a doorbell agent for controllingthe execution sequence of tasks for the network processor.
 22. Thecommunications processor of claim 20, wherein the compounds include amultireader agent for servicing requests to read data from the internalmemory.
 23. The communications processor of claim 20, wherein thecompounds include a message sender agent for sending messages onto thering network.
 24. The communications processor of claim 20, wherein thecompounds include a DMA agent for sending messages to initiate a DMAcontroller on the ring network.
 25. The communications processor ofclaim 20, wherein the compounds include a CRC agent for performing CRCcalculations.
 26. The communications processor of claim 20, wherein thecompounds include a debug module.